Comments (36)
Yes, it's part of https://github.com/EESSI/software-layer . Your timing is pretty good, I very recently made a PR to our docs to explain how to use it to replicate build failures. PR isn't merged yet, but it's markdown, so you can simply view a rendered version in my feature branch. Links won't work in there, but I guess you can find your way around if need be - though I think this one markdown doc should cover it all.
from fftw3.
I followed this issue here from the EESSI repo. I'm trying to reproduce, but I haven't been able to do so . I've tried gcc 13.2.0, with Open MPI 4.1.6 and Open MPI 4.1.5. I'm running on an AWS hpc7g instance (ubuntu 2204). After being unable to reproduce directly from fftw source, I tried the following easybuild:
eb -dfr --from-pr 18884 --prefix=/fsx/eb --disable-cleanup-builddir
which is based on trying to reproduce https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082.
After the build, I can run make check
in the builddir, but none of them reproduce the crash. Do you have any other suggestions on how to reproduce?
from fftw3.
One observation I have is that all the failures I've seen reported are from mpi-bench. It is true that mpirun may do slightly different things when it detects that it is running as part of a Slurm job. Can you provide any detail about how the slurm job is allocated or launched?
from fftw3.
I'm not sure of the exact job characteristics for the test build reported in https://gist.github.com/boegel/d97b974b8780c93753f1bf7462367082
For the builds done in EESSI I also couldn't tell you exactly what resources were requested in the job. But: this is run in a container, and then in a shell in which the only SLURM related job variable that is set is the SLURM_JOB_ID
. So, I'm not sure if there is much for mpirun
to pick up on here to figure out it actually is in a SLURM environment... Of course, SLURM can do things like set cgroups
etc, which potentially affect how things run, but I couldn't tell you if that is done on this cluster. All node allocations here are exclusive, so I don't think a cgroup
would do much anyway (as it would encompass the entire VM).
I did notice that I had fewer failures when I did the building interactively (though still in a job environment, it was an interactive SLURM job), as mentioned here. That seems to confirm that somehow environment has an affect, but... I couldn't really say what. This is a hard one :(
from fftw3.
Hm, I suddenly realize one difference between our bot building for EESSI, and your typical interactive environment: the bot not only builds in the container, it builds in a writeable overlay in the container. That tends to be a bit sluggish in terms of I/O. I'm wondering if that can somehow affect how these tests run. It's a bit far-fetched, and I wouldn't be able to explain the mechanism that makes it fail, but it would explain why my own interactive attempts showed a much higher success rate.
from fftw3.
Hm, in that container I wonder how many CPUs were allocated to it? I saw it was configured to allow oversubscription, I guess there is probably only 1 CPU core, which is different from my testing...
from fftw3.
Our build nodes in AWS have 16 cores (*.4xlarge
instances in AWS), using a single core would be way too slow.
Not sure what @casparvl used for testing interactively
from fftw3.
Is there a way for me to get access to that build container so I may try it myself?
from fftw3.
Btw, I've tried to reproduce it once again, since we now have a new build cluster (based on Magic Castle instead of Cluster in the Cloud). I've only tried interactively (basically following the docs I just shared), and I cannot for the life of me replicate our own issue. As mentioned in the original issue, interactively I had much higher success rates (9/10 times more or less), but I've ran
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 4 `pwd`/mpi-bench"
at least 20 times without failures now.
I'd love to see if the error still occurs when the bot builds it (as there it was consistently failing before), but my initial attempt failed for other reasons (basically, the bot cannot reinstall anything that already exists in the EESSI software stack - if you try, it'll fail on trying to change permissions on a read only file). I'll check with others if there is something I can do to work around this, so that I can actually trigger a rebuild with the bot.
from fftw3.
Yeah, I ran it over 200 times without failure on my cluster. Thank you for the pointers in that doc PR. I'll use that to try and trigger it again.
from fftw3.
@casparvl Should I temporarily revive a node in our old CitC Slurm cluster, to check if the problem was somehow specific to that environment?
from fftw3.
@casparvl I haven't had the time to reproduce within a container. Are we still seeing the testing failures occur or is it not happening on the newer build cluster?
from fftw3.
I am still seeing this problem on the our build cluster, when doing a test installation (in an interactive session) of FFTW.MPI/3.3.10-gompi-2023a
for the new EESSI repository software.eessi.io
.
A first attempt resulted in a segfault:
[aarch64-neoverse-v1-node2:2475846] Signal: Segmentation fault (11)
[aarch64-neoverse-v1-node2:2475846] Signal code: (-6)
[aarch64-neoverse-v1-node2:2475846] Failing at address: 0xea670025c746
[aarch64-neoverse-v1-node2:2475846] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000042507a0]
[aarch64-neoverse-v1-node2:2475846] [ 1] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_generic_simple_position+0x10)[0x400004815b10]
[aarch64-neoverse-v1-node2:2475846] [ 2] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_convertor_set_position_nocheck+0x120)[0x40000480dc60]
[aarch64-neoverse-v1-node2:2475846] [ 3] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x360)[0x4000067f3be0]
[aarch64-neoverse-v1-node2:2475846] [ 4] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x424)[0x4000060b6ec4]
[aarch64-neoverse-v1-node2:2475846] [ 5] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libopen-pal.so.40(opal_progress+0x3c)[0x4000047fc99c]
[aarch64-neoverse-v1-node2:2475846] [ 6] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x2e4)[0x4000067ec3a4]
[aarch64-neoverse-v1-node2:2475846] [ 7] /cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/neoverse_n1/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libmpi.so.40(MPI_Sendrecv+0x188)[0x4000044a1228]
[aarch64-neoverse-v1-node2:2475846] [ 8] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa424)[0x40000426a424]
[aarch64-neoverse-v1-node2:2475846] [ 9] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xa480)[0x40000426a480]
[aarch64-neoverse-v1-node2:2475846] [10] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/libfftw3f_mpi.so.3(+0xc2e8)[0x40000426c2e8]
[aarch64-neoverse-v1-node2:2475846] [11] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40586c]
[aarch64-neoverse-v1-node2:2475846] [12] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409144]
[aarch64-neoverse-v1-node2:2475846] [13] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40aed4]
[aarch64-neoverse-v1-node2:2475846] [14] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x409434]
[aarch64-neoverse-v1-node2:2475846] [15] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x4083c0]
[aarch64-neoverse-v1-node2:2475846] [16] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x408430]
[aarch64-neoverse-v1-node2:2475846] [17] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x40619c]
[aarch64-neoverse-v1-node2:2475846] [18] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(+0x26a7c)[0x400004586a7c]
[aarch64-neoverse-v1-node2:2475846] [19] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6(__libc_start_main+0x98)[0x400004586b4c]
[aarch64-neoverse-v1-node2:2475846] [20] /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/.libs/lt-mpi-bench[0x402ef0]
[aarch64-neoverse-v1-node2:2475846] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node aarch64-neoverse-v1-node2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench: --verify 'ofc]12x11x4x7' --verify 'ifc]12x11x4x7' --verify 'ok7hx3o01x13e11' --verify 'ik7hx3o01x13e11' --verify 'obr6x2x8' --verify 'ibr6x2x8' --verify 'ofr6x2x8' --verify 'ifr6x2x8' --verify 'obc6x2x8' --verify 'ibc6x2x8' --verify 'ofc6x2x8' --verify 'ifc6x2x8' --verify 'ok]7o00x3o10' --verify 'ik]7o00x3o10' --verify 'ofr]5x6x12x10v1' --verify 'ifr]5x6x12x10v1' --verify 'obc]5x6x12x10v1' --verify 'ibc]5x6x12x10v1' --verify 'ofc]5x6x12x10v1' --verify 'ifc]5x6x12x10v1' --verify 'ok[3e11x13e11x9e10x9e00' --verify 'ik[3e11x13e11x9e10x9e00' --verify 'obr9x9' --verify 'ibr9x9' --verify 'ofr9x9' --verify 'ifr9x9' --verify 'obc9x9' --verify 'ibc9x9' --verify 'ofc9x9' --verify 'ifc9x9' --verify 'obrd11x24' --verify 'ibrd11x24' --verify 'ofrd11x24' --verify 'ifrd11x24' --verify 'obcd11x24' --verify 'ibcd11x24' --verify 'ofcd11x24' --verify 'ifcd11x24' --verify 'ok]8bx5o00x7o00x9e00' --verify 'ik]8bx5o00x7o00x9e00' --verify 'obc936' --verify 'ibc936' --verify 'ofc936' --verify 'ifc936'
make[3]: *** [Makefile:997: check-local] Error 1
make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi'
A 2nd attempt showed relative error
again:
--------------------------------------------------------------
MPI FFTW transforms passed 10 tests, 1 CPU
--------------------------------------------------------------
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 2 `pwd`/mpi-bench"
obr[18x8x20 3.69362e-34 6.89042e-34 5.06167e-34
ibr[18x8x20 2.97137e-34 9.18722e-34 4.77233e-34
obc[18x8x20 3.25632e-34 5.74201e-34 7.15416e-34
ibc[18x8x20 3.65554e-34 6.89042e-34 6.48955e-34
ofc[18x8x20 3.19732e-34 5.74201e-34 6.1247e-34
ifc[18x8x20 3.37891e-34 5.74201e-34 8.06618e-34
ofr]3x13x9v6 2.8078e-34 3.83781e-34 8.48571e-34
ifr]3x13x9v6 2.98203e-34 3.83781e-34 7.858e-34
obc]3x13x9v6 3.76241e-34 5.75672e-34 7.44837e-34
Found relative error 5.965609e-01 (time shift)
0 4.921836295807 -1.026863218297 4.921836295807 -1.026863218297
1 -2.116798599036 2.941852671019 -2.116798599036 2.941852671019
2 1.771648568109 -0.438286594686 1.771648568109 -0.438286594686
3 -4.602865281776 5.484918179038 -4.602865281776 5.484918179038
4 2.819387086928 0.816109936207 2.819387086928 0.816109936207
5 -6.414801972466 5.098682093116 -6.414801972466 5.098682093116
6 -9.366154153178 -8.548834590260 -9.366154153178 -8.548834590260
7 -5.574288760734 4.865193642565 -5.574288760734 4.865193642565
8 -14.824759940401 5.035876685740 -14.824759940401 5.035876685740
9 -14.634486341483 1.743297726353 -14.634486341483 1.743297726353
10 -0.602576641742 -1.289954675842 -0.602576641742 -1.289954675842
11 3.105024350062 4.204442622313 3.105024350062 4.204442622313
12 9.179168922129 2.120173302885 9.179168922129 2.120173302885
13 -6.635818177405 -0.070873458827 -6.635818177405 -0.070873458827
14 -7.664759877207 -1.432949782471 -7.664759877207 -1.432949782471
15 2.136338652683 -3.130528874653 2.136338652683 -3.130528874653
16 -8.538098299824 5.591890241715 -8.538098299824 5.591890241715
17 -0.991271869558 -3.379819003153 -0.991271869558 -3.379819003153
18 -5.661447610867 7.529683912859 -5.661447610867 7.529683912859
19 -6.086167949355 -1.670238032124 -6.086167949355 -1.670238032124
20 -6.469068064222 7.690184841734 -6.469068064222 7.690184841734
21 1.465582751707 7.789424354982 1.465582751707 7.789424354982
22 -4.932751249830 -0.964580902292 -4.932751249830 -0.964580902292
23 -4.495483109168 3.138270992002 -4.495483109168 3.138270992002
24 -4.298238335069 -2.009670396150 -4.298238335069 -2.009670396150
25 -5.616046746225 -1.630859337171 -5.616046746225 -1.630859337171
26 -0.721988139199 -3.380289724460 -0.721988139199 -3.380289724460
27 6.817499183174 3.754929401943 6.817499183174 3.754929401943
28 -0.030920191076 4.600357644276 -0.030920191076 4.600357644276
29 0.839410098370 -1.908666344239 0.839410098370 -1.908666344239
30 3.385789170379 4.595090032781 3.385789170379 4.595090032781
31 4.379259724979 0.784057635193 4.379259724979 0.784057635193
32 11.841737046195 -6.148986050574 11.841737046195 -6.148986050574
33 -4.188145406309 4.506890617698 -4.188145406309 4.506890617698
34 2.987555638465 9.441583497205 2.987555638465 9.441583497205
35 -8.098881460448 -0.524743787520 -8.098881460448 -0.524743787520
36 -1.600749567878 -6.044191420031 -1.600749567878 -6.044191420031
37 0.163953738123 2.046467146682 0.163953738123 2.046467146682
38 -5.113238538613 -3.363399510184 -5.113238538613 -3.363399510184
39 -2.872798422536 -8.040245973957 -2.872798422536 -8.040245973957
40 -10.392736117901 -2.761172631260 -10.392736117901 -2.761172631260
41 4.039519771041 8.003816207053 4.039519771041 8.003816207053
42 1.790990423870 -8.383785422669 1.790990423870 -8.383785422669
43 -9.165783259172 -2.186625587455 -9.165783259172 -2.186625587455
44 5.007541953864 5.543722867012 5.007541953864 5.543722867012
45 2.365419650732 2.977801310135 2.365419650732 2.977801310135
46 -3.377120254702 4.906540019430 -3.377120254702 4.906540019430
47 -0.010783860068 4.273408211548 -0.010783860068 4.273408211548
48 -6.894392286266 6.830078049229 -6.894392286266 6.830078049229
49 -3.254264449347 6.744977714739 -3.254264449347 6.744977714739
50 -8.471641489793 5.603488318600 -8.471641489793 5.603488318600
51 0.084130029380 1.367262769771 0.084130029380 1.367262769771
52 1.482642504437 -4.602524328752 1.482642504437 -4.602524328752
53 0.788628072835 -9.891756192852 0.788628072835 -9.891756192852
54 -2.633046303010 11.214109607678 -2.633046303010 11.214109607678
55 -3.192499246401 3.363355364265 -3.192499246401 3.363355364265
56 -1.598444209258 3.573016880938 -1.598444209258 3.573016880938
57 5.522584641083 0.912730173997 5.522584641083 0.912730173997
58 -2.850571159892 -3.538531368267 -2.850571159892 -3.538531368267
59 0.289119554985 1.226480324376 0.289119554985 1.226480324376
60 -1.310174923968 -3.091891051678 -1.310174923968 -3.091891051678
61 -2.749495846212 -9.372017422996 -2.749495846212 -9.372017422996
62 3.279899011670 4.859168417630 3.279899011670 4.859168417630
63 2.379547285718 1.774931614389 2.379547285718 1.774931614389
64 4.662292029542 2.025644366541 4.662292029542 2.025644366541
65 -6.175223059442 -1.891888996868 -6.175223059442 -1.891888996868
66 1.731642745422 14.247081701735 1.731642745422 14.247081701735
67 -10.929576224104 -8.727780396180 -10.929576224104 -8.727780396180
68 5.844513943309 -1.235652769240 5.844513943309 -1.235652769240
69 4.853189951788 0.397500732336 4.853189951788 0.397500732336
70 1.645686104377 1.838816934461 1.645686104377 1.838816934461
71 -1.387808178933 -6.069222393915 -1.387808178933 -6.069222393915
72 -8.640352779734 7.623552803539 -8.640352779734 7.623552803539
73 -2.621092502218 6.557474990141 -2.621092502218 6.557474990141
74 -2.460425638794 0.126130793461 -2.460425638794 0.126130793461
75 -3.642105748754 -3.042790015208 -3.642105748754 -3.042790015208
76 0.903895069572 5.573680347688 0.903895069572 5.573680347688
77 -3.850746636008 -0.664540783961 -3.850746636008 -0.664540783961
78 2.670783169330 1.168453854800 2.670783169330 1.168453854800
79 0.863490161325 2.800910717379 0.863490161325 2.800910717379
80 -10.408734415051 -0.623237951468 -10.408734415051 -0.623237951468
81 -6.746215176255 -10.162136743830 -6.746215176255 -10.162136743830
82 6.010383700192 2.700168967362 6.010383700192 2.700168967362
83 7.250381313471 2.507195619411 7.250381313471 2.507195619411
84 5.728973913944 -2.066599007246 5.728973913944 -2.066599007246
85 -10.049824910825 5.688927229637 -10.049824910825 5.688927229637
86 2.592017899133 -1.850191728792 2.592017899133 -1.850191728792
87 10.779025866591 -1.076683736319 10.779025866591 -1.076683736319
88 -4.383388756630 1.650480826796 -4.383388756630 1.650480826796
89 -0.055685598972 -3.774783473873 -0.055685598972 -3.774783473873
90 6.628995072655 1.367150047102 6.628995072655 1.367150047102
91 -0.810232261568 -2.976939725877 -0.810232261568 -2.976939725877
92 -0.207344369538 -4.505328272435 -0.207344369538 -4.505328272435
93 5.262364487884 5.245089127649 5.262364487884 5.245089127649
94 -0.879545465455 -6.694733840184 -0.879545465455 -6.694733840184
95 -0.807449055017 -5.586509120899 -0.807449055017 -5.586509120899
96 4.706214482159 1.081938739490 4.706214482159 1.081938739490
97 -1.981403259786 7.529674456958 -1.981403259786 7.529674456958
98 2.203956996302 -4.983523613820 2.203956996302 -4.983523613820
99 -2.296628834421 2.179234813172 -2.296628834421 2.179234813172
100 9.173485452525 3.228133868069 9.173485452525 3.228133868069
101 -6.386943659435 6.926987789753 -6.386943659435 6.926987789753
102 3.076153055928 1.493617153748 3.076153055928 1.493617153748
103 10.054141435677 13.326661925432 10.054141435677 13.326661925432
104 8.463391787584 -5.877325613584 8.463391787584 -5.877325613584
105 -0.696625001947 -3.802301741098 -0.696625001947 -3.802301741098
106 -8.196977873692 -2.069536940407 -8.196977873692 -2.069536940407
107 2.948666032147 2.516823938344 2.948666032147 2.516823938344
108 -7.976790406507 -8.442930303150 -7.976790406507 -8.442930303150
109 -2.921418292350 0.328394194535 -2.921418292350 0.328394194535
110 2.105361692243 -1.048071016627 2.105361692243 -1.048071016627
111 -0.122956865261 -3.178104995804 -0.122956865261 -3.178104995804
112 1.377690789409 -1.577444205340 1.377690789409 -1.577444205340
113 -4.004584148861 -4.382890836537 -4.004584148861 -4.382890836537
114 0.011427712451 6.324099444670 0.011427712451 6.324099444670
115 5.826045729088 -14.340030576439 5.826045729088 -14.340030576439
116 7.577427495586 2.873642967239 7.577427495586 2.873642967239
117 -1.210172393913 3.087617904153 -1.210172393913 3.087617904153
118 4.129688769436 -0.269191081687 4.129688769436 -0.269191081687
119 2.498805623692 8.629698093887 2.498805623692 8.629698093887
120 0.180001563022 1.905778234978 0.180001563022 1.905778234978
121 7.007577095520 -8.896514053123 7.007577095520 -8.896514053123
122 6.566401034660 -3.159194023820 6.566401034660 -3.159194023820
123 -7.616361041524 -6.592271202720 -7.616361041524 -6.592271202720
124 -7.030945328309 -2.404710690963 -7.030945328309 -2.404710690963
125 -4.795666771461 7.565990037469 -4.795666771461 7.565990037469
126 -2.375104348185 1.918133142771 -2.375104348185 1.918133142771
127 4.793627396078 -11.569053139350 4.793627396078 -11.569053139350
128 0.825614651653 -5.877317639277 0.825614651653 -5.877317639277
129 6.404638041792 7.660923814373 6.404638041792 7.660923814373
130 -5.608845937279 6.189883435798 -5.608845937279 6.189883435798
131 -2.052132858903 -3.021527799608 -2.052132858903 -3.021527799608
132 -3.547036584342 -8.799408090505 -3.547036584342 -8.799408090505
133 -0.668169395838 0.242810562341 -0.668169395838 0.242810562341
134 6.968865621898 6.811579013049 6.968865621898 6.811579013049
135 -4.777970484256 4.042227001858 -4.777970484256 4.042227001858
136 -8.001526926080 -6.737608100204 -8.001526926080 -6.737608100204
137 -0.028276084813 2.602238255100 -0.028276084813 2.602238255100
138 -0.512308956568 -6.981404730691 -0.512308956568 -6.981404730691
139 9.387188053728 -13.669568093222 9.387188053728 -13.669568093222
140 4.861551650830 4.744755456001 4.861551650830 4.744755456001
141 -1.458785307411 4.663376600332 -1.458785307411 4.663376600332
142 -3.825099657621 -4.135819803719 -3.825099657621 -4.135819803719
143 7.337295848097 -5.254042209712 7.337295848097 -5.254042209712
144 -0.313260555864 6.771687218297 -0.313260555864 6.771687218297
145 -3.163795468737 8.593709314445 -3.163795468737 8.593709314445
146 -1.637608385520 -3.916686625097 -1.637608385520 -3.916686625097
147 2.893356680624 3.492613129989 2.893356680624 3.492613129989
148 -0.241462371122 7.603996304141 -0.241462371122 7.603996304141
149 4.792674968811 6.244544979428 4.792674968811 6.244544979428
150 -4.187404522818 3.480699468993 -4.187404522818 3.480699468993
151 2.275412058088 8.711606271295 2.275412058088 8.711606271295
152 9.309440618908 8.500678323888 9.309440618908 8.500678323888
153 -5.146801960557 0.480271780127 -5.146801960557 0.480271780127
154 -0.342934280885 4.006492082219 -0.342934280885 4.006492082219
155 -0.520225001067 -2.871435828872 -0.520225001067 -2.871435828872
156 -3.872971943304 1.447235114939 -3.872971943304 1.447235114939
157 -6.260170736857 -4.000013983045 -6.260170736857 -4.000013983045
158 -1.793247919295 4.904867267000 -1.793247919295 4.904867267000
159 -5.476491940734 2.221240632587 -5.476491940734 2.221240632587
160 -6.926551145538 4.990927999485 -6.926551145538 4.990927999485
161 8.742622092574 10.567091128674 8.742622092574 10.567091128674
162 1.485449402871 3.314914414563 1.485449402871 3.314914414563
163 -6.730468872131 -5.788026037934 -6.730468872131 -5.788026037934
164 -2.981885794878 -2.587215880092 -2.981885794878 -2.587215880092
165 1.447396351206 -12.814694105896 1.447396351206 -12.814694105896
166 -2.474143000457 -8.199676906604 -2.474143000457 -8.199676906604
167 -6.968727036826 6.621321359661 -6.968727036826 6.621321359661
168 -3.257964523801 0.484386452538 -3.257964523801 0.484386452538
169 2.319015390451 -1.703639037599 2.319015390451 -1.703639037599
170 -1.645353274574 11.946438535003 -1.645353274574 11.946438535003
171 -0.711343655735 -4.829312331723 -0.711343655735 -4.829312331723
172 -0.462013339680 3.395796127960 -0.462013339680 3.395796127960
173 -1.879403680530 -1.220043545876 -1.879403680530 -1.220043545876
174 1.907603137772 4.707561015705 1.907603137772 4.707561015705
175 4.690694650819 -3.134057632254 4.690694650819 -3.134057632254
176 -0.731397825734 10.216171123903 -0.731397825734 10.216171123903
177 1.727112370787 1.537556202680 1.727112370787 1.537556202680
178 9.804231130535 -3.050822838002 9.804231130535 -3.050822838002
179 -3.521642259704 8.644200067602 -3.521642259704 8.644200067602
180 1.847171292586 6.297594444781 1.847171292586 6.297594444781
181 -2.944580056826 -8.904668383923 -2.944580056826 -8.904668383923
182 -0.479878773202 9.252293550971 -0.479878773202 9.252293550971
183 8.105438096502 -0.100680885472 8.105438096502 -0.100680885472
184 -3.261705711112 5.625865138249 -3.261705711112 5.625865138249
185 -9.001340472449 3.481531232669 -9.001340472449 3.481531232669
186 -9.922858428321 8.928064172077 -9.922858428321 8.928064172077
187 -0.262071419781 -2.637186613527 -0.262071419781 -2.637186613527
188 13.976902439634 2.365843139075 13.976902439634 2.365843139075
189 -1.200796504034 -4.514856210494 -1.200796504034 -4.514856210494
190 10.243971312066 4.464830376249 10.243971312066 4.464830376249
191 -2.874371721067 -6.435933215796 -2.874371721067 -6.435933215796
192 1.013001932314 2.999699060836 1.013001932314 2.999699060836
193 -0.993840710862 -6.386582096375 -0.993840710862 -6.386582096375
194 3.561437964884 7.779957565555 3.561437964884 7.779957565555
195 9.312380923566 -6.079419786231 9.312380923566 -6.079419786231
196 0.417492073520 0.675369898888 0.417492073520 0.675369898888
197 -5.373267320387 -5.228378193047 -5.373267320387 -5.228378193047
198 2.811480320243 -1.530828750353 2.811480320243 -1.530828750353
199 -3.810636424898 -4.270965066499 -3.810636424898 -4.270965066499
200 -1.929116070223 -2.795097831046 -1.929116070223 -2.795097831046
201 -4.910461489544 3.949953732577 -4.910461489544 3.949953732577
202 -1.110838593410 0.859180227354 -1.110838593410 0.859180227354
203 -2.647010599309 10.090425689658 -2.647010599309 10.090425689658
204 0.055618930342 10.953225553089 0.055618930342 10.953225553089
205 7.677359001006 1.191345729669 7.677359001006 1.191345729669
206 -1.498323690654 0.356861042000 -1.498323690654 0.356861042000
207 2.097613279533 1.809878602708 2.097613279533 1.809878602708
208 -10.433389542881 -2.767883226818 -10.433389542881 -2.767883226818
209 4.485006007605 -2.861710075652 4.485006007605 -2.861710075652
210 -11.299061334429 -4.240819220427 -11.299061334429 -4.240819220427
211 0.889330359867 -6.122606788728 0.889330359867 -6.122606788728
212 1.644972522082 -4.130609805857 1.644972522082 -4.130609805857
213 3.119752911839 5.520336783880 3.119752911839 5.520336783880
214 6.451529263230 -6.991195712115 6.451529263230 -6.991195712115
215 1.950360060868 -5.530643072460 1.950360060868 -5.530643072460
216 6.040150340031 4.344206024582 6.040150340031 4.344206024582
217 1.752640417864 -13.456400163587 1.752640417864 -13.456400163587
218 13.891823564455 -4.615650120662 13.891823564455 -4.615650120662
219 3.353087607440 -5.568825085630 3.353087607440 -5.568825085630
220 -0.238755291286 -0.122203225111 -0.238755291286 -0.122203225111
221 -4.994487942039 -2.421765879674 -4.994487942039 -2.421765879674
222 -8.659719429484 2.921445084397 -8.659719429484 2.921445084397
223 1.322825765261 6.523414416538 1.322825765261 6.523414416538
224 0.383609830312 -11.798272908431 0.383609830312 -11.798272908431
225 -4.959900682847 -4.419719391506 -4.959900682847 -4.419719391506
226 -1.407603649500 -2.756941605224 -1.407603649500 -2.756941605224
227 -7.044264525785 0.083191244366 -7.044264525785 0.083191244366
228 5.027093393519 5.195264035163 5.027093393519 5.195264035163
229 1.563992212574 -0.701216248220 1.563992212574 -0.701216248220
230 0.306554234674 4.476987321667 0.306554234674 4.476987321667
231 1.226269348284 -2.296913229853 1.226269348284 -2.296913229853
232 -4.098996468141 -6.855165091528 -4.098996468141 -6.855165091528
233 -8.845292917687 -0.923422749681 -8.845292917687 -0.923422749681
234 -4.250287799692 3.557076157786 -4.250287799692 3.557076157786
235 0.469057774787 8.279657163755 0.469057774787 8.279657163755
236 4.340048272752 -0.232303117938 4.340048272752 -0.232303117938
237 1.752288340162 -4.554038855546 1.752288340162 -4.554038855546
238 -2.786461997863 0.349152549109 -2.786461997863 0.349152549109
239 -9.048613296502 4.902932369427 -9.048613296502 4.902932369427
240 1.292868067079 10.372646253328 1.292868067079 10.372646253328
251 6.531174970235 3.643867565208 6.531174970235 3.643867565208
252 -4.541992135905 0.814485798927 -4.541992135905 0.814485798927
253 10.547201095289 13.176243470534 10.547201095289 13.176243470534
254 1.680946638121 13.004362273317 1.680946638121 13.004362273317
255 7.244027605296 -1.038411963768 7.244027605296 -1.038411963768
262 -5.925648177938 -1.268203314222 -5.925648177938 -1.268203314222
267 1.353136474995 -2.122072383697 1.353136474995 -2.122072383697
268 3.486826915908 1.572000562698 3.486826915908 1.572000562698
269 -4.426443526401 2.623133044118 -4.426443526401 2.623133044118
270 8.632661103766 -11.643280600455 8.632661103766 -11.643280600455
271 3.232947482898 8.184951094877 3.232947482898 8.184951094877
272 -0.058647847283 -6.334711265114 -0.058647847283 -6.334711265114
273 -0.941586491285 8.349265532221 -0.941586491285 8.349265532221
274 -6.447305295794 -5.049955925120 -6.447305295794 -5.049955925120
275 -11.945598200550 -4.015966059585 -11.945598200550 -4.015966059585
276 0.362306726308 14.450774594960 0.362306726308 14.450774594960
277 -13.833931179943 -7.361791104432 -13.833931179943 -7.361791104432
278 0.768431424357 6.017412350709 0.768431424357 6.017412350709
279 0.030233743672 -2.307785598212 0.030233743672 -2.307785598212
280 10.014906601641 3.051751435802 10.014906601641 3.051751435802
281 5.112072879529 -4.132941863717 5.112072879529 -4.132941863717
282 -0.438617802708 10.276119662869 -0.438617802708 10.276119662869
283 -3.027137527217 -0.561076703303 -3.027137527217 -0.561076703303
284 3.926003026828 -4.086725429315 3.926003026828 -4.086725429315
285 0.786845785234 -1.530531963474 0.786845785234 -1.530531963474
286 -2.893235031611 -8.453773261229 -2.893235031611 -8.453773261229
287 11.596087554883 -4.013957133276 11.596087554883 -4.013957133276
288 -4.988489747276 11.688234628619 -4.988489747276 11.688234628619
289 -5.099846866775 3.149676053203 -5.099846866775 3.149676053203
290 3.993544832699 -2.176510514608 3.993544832699 -2.176510514608
291 1.791994775922 2.679198098395 1.791994775922 2.679198098395
292 6.229541027538 7.197596224506 6.229541027538 7.197596224506
293 -2.690450075242 6.678106532908 -2.690450075242 6.678106532908
294 7.028412388425 10.238169492735 7.028412388425 10.238169492735
295 -4.703505231104 -6.328634949054 -4.703505231104 -6.328634949054
296 -14.073077800312 -6.540533668748 -14.073077800312 -6.540533668748
297 -2.359761010290 4.669844938190 -2.359761010290 4.669844938190
298 -3.973951647153 -7.985259797914 -3.973951647153 -7.985259797914
299 4.741028202046 -0.901953828990 4.741028202046 -0.901953828990
FAILED mpirun -np 2 /tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi/mpi-bench: --verify 'obr[18x8x20' --verify 'ibr[18x8x20' --verify 'obc[18x8x20' --verify 'ibc[18x8x20' --verify 'ofc[18x8x20' --verify 'ifc[18x8x20' --verify 'ofr]3x13x9v6' --verify 'ifr]3x13x9v6' --verify 'obc]3x13x9v6' --verify 'ibc]3x13x9v6' --verify 'ofc]3x13x9v6' --verify 'ifc]3x13x9v6' --verify 'okd[10e00x7e01x4e00v11' --verify 'ikd[10e00x7e01x4e00v11' --verify 'obrd[10x11x3x10' --verify 'ibrd[10x11x3x10' --verify 'obcd[10x11x3x10' --verify 'ibcd[10x11x3x10' --verify 'ofcd[10x11x3x10' --verify 'ifcd[10x11x3x10' --verify 'okd]9o10x10e00x10e00x10b' --verify 'ikd]9o10x10e00x10e00x10b' --verify 'ofrd]3x12x6v6' --verify 'ifrd]3x12x6v6' --verify 'obcd]3x12x6v6' --verify 'ibcd]3x12x6v6' --verify 'ofcd]3x12x6v6' --verify 'ifcd]3x12x6v6' --verify 'okd6bx2e00v9' --verify 'ikd6bx2e00v9' --verify 'obr5x2x6v2' --verify 'ibr5x2x6v2' --verify 'ofr5x2x6v2' --verify 'ifr5x2x6v2' --verify 'obc5x2x6v2' --verify 'ibc5x2x6v2' --verify 'ofc5x2x6v2' --verify 'ifc5x2x6v2' --verify 'ofr]12x5x10v3' --verify 'ifr]12x5x10v3' --verify 'obc]12x5x10v3' --verify 'ibc]12x5x10v3' --verify 'ofc]12x5x10v3' --verify 'ifc]12x5x10v3'
make[3]: *** [Makefile:997: check-local] Error 1
make[3]: Leaving directory '/tmp/boegel/easybuild/build/FFTWMPI/3.3.10/gompi-2023a/fftw-3.3.10/mpi'
from fftw3.
I tried to replicate this over the weekend. @casparvl's documentation was extremely helpful, thank you! I tried to debug this PR: https://github.com/EESSI/software-layer/pull/374/files
git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/essi-fftw1
And then within the easybuild container did this in a loop:
eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
It ran 374 times over the weekend without failure on an hpc7g.16xlarge (64 cores).
@casparvl sounded like you suspected a writable overlay could cause more slugish I/O. I'm not familiar enough with eessi container, but I think with the access rw I have done that, correct?
Do either of you have other ideas for me to change? I suppose I can switch to a c7g.4xlarge....
from fftw3.
I was able to compile and successfully run on c7g.4xlarge as well, with no issues there either.
from fftw3.
@casparvl Do you have other ideas on how I can try to reproduce? I'm not sure if it matters, but my attempt was on an Ubuntu 2004 and the container was started using: ./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
where the mount was hosted from FSx for Lustre file system.
My repeated testing was repeated calls of eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
rather than repeatedly starting the container.
from fftw3.
Sorry for failing to come back to you on this. I'll try again myself as well. I just did one install, which indeed was succesfull. Second time, I ran into the same error as @boegel had the 2nd time around:
MPI FFTW transforms passed 10 tests, 3 CPUs
--------------------------------------------------------------
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10 --mpi "mpirun -np 4 `pwd`/mpi-bench"
Executing "mpirun -np 4 /tmp/eessi-debug.n0muoZ0cuh/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1 --verify 'ofc10x10x3' --verify 'ifc10x10x3' --verify 'ok
]16bx11o11v6' --verify 'ik]16bx11o11v6' --verify 'ofr]12x13x8' --verify 'ifr]12x13x8' --verify 'obc]12x13x8' --verify 'ibc]12x13x8' --verify 'ofc]12x13x8' --verify 'ifc]12x13x8' --verify 'okd
]12o01x30e00' --verify 'ikd]12o01x30e00' --verify 'ofrd]3x6x3x4' --verify 'ifrd]3x6x3x4' --verify 'obcd]3x6x3x4' --verify 'ibcd]3x6x3x4' --verify 'ofcd]3x6x3x4' --verify 'ifcd]3x6x3x4' --veri
fy 'okd[8o11x9e10x10o00x10e01' --verify 'ikd[8o11x9e10x10o00x10e01' --verify 'obrd12x12x5v2' --verify 'ibrd12x12x5v2' --verify 'ofrd12x12x5v2' --verify 'ifrd12x12x5v2' --verify 'obcd12x12x5v2' --verify 'ibcd12x12x5v2' --verify 'ofcd12x12x5v2' --verify 'ifcd12x12x5v2' --verify 'ok[13e11x52o00' --verify 'ik[13e11x52o00' --verify 'obrd[8x7v2' --verify 'ibrd[8x7v2' --verify 'obcd[8x7v2' --verify 'ibcd[8x7v2' --verify 'ofcd[8x7v2' --verify 'ifcd[8x7v2' --verify 'obr12x3x2x8' --verify 'ibr12x3x2x8' --verify 'ofr12x3x2x8' --verify 'ifr12x3x2x8' --verify 'obc12x3x2x8' --verify 'ibc12x3x2x8' --verify 'ofc12x3x2x8' --verify 'ifc12x3x2x8'"
ofc10x10x3 1.95174e-07 3.30362e-07 1.86409e-07
ifc10x10x3 1.7346e-07 3.30362e-07 2.59827e-07
ok]16bx11o11v6 1.73834e-07 1.48147e-06 1.88905e-07
ik]16bx11o11v6 2.28489e-07 1.60348e-06 1.94972e-07
ofr]12x13x8 2.74646e-07 4.3193e-07 1.84938e-07
ifr]12x13x8 1.88937e-07 4.3193e-07 1.63803e-07
obc]12x13x8 2.10673e-07 4.3193e-07 2.28376e-07
ibc]12x13x8 1.97341e-07 4.3193e-07 2.27807e-07
ofc]12x13x8 2.19374e-07 5.39912e-07 2.17205e-07
ifc]12x13x8 2.08943e-07 4.3193e-07 2.19416e-07
okd]12o01x30e00 2.51417e-07 4.47886e-06 1.94862e-07
ikd]12o01x30e00 2.48254e-07 5.89166e-06 3.59064e-07
ofrd]3x6x3x4 1.82793e-07 2.59557e-07 1.48863e-07
ifrd]3x6x3x4 1.75387e-07 2.59557e-07 1.83453e-07
obcd]3x6x3x4 1.89722e-07 3.24447e-07 1.87965e-07
ibcd]3x6x3x4 1.94751e-07 3.24447e-07 1.69235e-07
ofcd]3x6x3x4 1.69961e-07 3.24447e-07 1.56861e-07
ifcd]3x6x3x4 1.82658e-07 3.24447e-07 1.69306e-07
Found relative error 2.900030e+35 (time shift)
0 -164.457138061523 -164.457199096680
1 -225.637115478516 -225.637100219727
2 -20.902750015259 -20.902732849121
3 172.414703369141 172.414733886719
4 -4.662590026855 -4.662593841553
5 7.010725498199 7.010738372803
6 -89.267349243164 -89.267364501953
7 326.806823730469 326.806823730469
8 -19.448410034180 -19.448524475098
9 69.001441955566 69.001434326172
10 -104.643005371094 -104.643020629883
11 -26.874126434326 -26.874076843262
12 -24.399785995483 -24.399751663208
13 -141.903198242188 -141.903198242188
14 -90.872367858887 -90.872360229492
15 -44.611225128174 -44.611217498779
16 41.871009826660 41.871009826660
17 176.062194824219 176.062194824219
18 -90.186141967773 -90.186141967773
19 -4.998665332794 -4.998687744141
Running it a third time, it completed succesfully again.
The only thing you don't mention explicitely is if you also followed the steps of activating the prefix environment & EESSI pilot stack, as described on https://www.eessi.io/docs/adding_software/debugging_failed_builds/ , and if you sourced the configure_easybuild
script. Did you do that?
If you didn't I guess that means you've built the full software stack from the ground up. If that's the case, and if that works, then I guess the conclusion is something is fishy with one of the FFTW.MPI dependencies we pick up from the EESSI pilot stack (and for which you would have done a fresh build). That's useful information, because it would show that the combination of using the dependencies from EESSI somehow trigger this issue. Also, it'd mean you could actually try those steps as well (i.e. start the prefix environment, start the EESSI pilot stack, source the configure_easybuild
script), and see if you can replicate the issue that way. That would unambiguously prove that the issue is somewhere in the dependencies that we already have in the stack.
Just for reference, this is a snippet of my history
from the point I start the container, to having run the eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
command once:
1 EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org/
2 EESSI_PILOT_VERSION=2023.06
3 source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
4 export WORKDIR=$(mktemp --directory --tmpdir=/tmp -t eessi-debug.XXXXXXXXXX)
5 source configure_easybuild
6 module load EasyBuild/4.8.1
7 eb --show-config
8 eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
The result of eb --show-config
is:
[EESSI pilot 2023.06] $ eb --show-config
#
# Current EasyBuild configuration
# (C: command line argument, D: default value, E: environment variable, F: configuration file)
#
buildpath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/build
containerpath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/containers
debug (E) = True
experimental (E) = True
filter-deps (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars (E) = LD_LIBRARY_PATH
hooks (E) = /home/casparvl/debug_PR374/software-layer/eb_hooks.py
ignore-osdeps (E) = True
installpath (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/testing
module-extensions (E) = True
packagepath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/packages
prefix (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild
read-only-installdir (E) = True
repositorypath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/ebfiles_repo
robot-paths (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath (E) = True
sourcepath (E) = /tmp/eessi-debug.n0muoZ0cuh/easybuild/sources:
sysroot (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace (E) = True
zip-logs (E) = bzip2
Curious to hear if you ran using the EESSI pilot stack for dependencies. Maybe you can also share your eb --show-config
output.
from fftw3.
I'm also still puzzled by the randomness of this issue. I'd love to better understand why the failrue of these tests are random. Is the input randomly generated? Is the algorithm simply non-deterministic (e.g. because of non-deterministic order in reduction operations or something of that nature)? I'd love to understand if that 'randomness' could somehow be affected by environment, as initially I seem to have seen many more failures in a job environment than interactively... But I'm not sure if any of you has such an intricate knowledge of what these particular tests do :)
from fftw3.
Yes, I'm afraid I can't speak for the fftw developers here, perhaps @matteo-frigo could help answer the question about what ../tests/check.pl
is checking, and if the failures are catastrophic or simply small precision errors?
from fftw3.
My complete steps are here:
git clone https://github.com/EESSI/software-layer.git
cd software-layer
git remote add https://github.com/casparvl/software-layer casparvl
git remote add casparvl https://github.com/casparvl/software-layer
git fetch casparvl
git checkout casparvl/fftw_test
./eessi_container.sh --access rw --save /fsx/lrbison/essi-fftw1
Apptainer> echo ${EESSI_CVMFS_REPO}; echo ${EESSI_PILOT_VERSION}
/cvmfs/pilot.eessi-hpc.org
2023.06
export EESSI_OS_TYPE=linux # We only support Linux for now
export EESSI_CPU_FAMILY=$(uname -m)
${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/startprefix
#...(wait a bit)
export EESSI_CVMFS_REPO=/cvmfs/pilot.eessi-hpc.org
export EESSI_PILOT_VERSION=2023.06
source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
export WORKDIR=/tmp/try1
source configure_easybuild
module load EasyBuild/4.8.1
eb --show-config
eb --easystack eessi-2023.06-eb-4.8.1-2022a.yml --robot
Sadly I didn't save my easybuild output, let me re-create again. I am curious, when you "retry" do you retry from eb --easystack...
or do you retry from ./eessi_container.sh ...
?
from fftw3.
Ok, so you also built on top of the dependencies that were already provided from the EESSI side. Then I really don't see any differences, other than (potentially) things in the environment... Strange!
I am curious, when you "retry" do you retry from eb --easystack... or do you retry from ./eessi_container.sh ...?
Like you, I retried from eb --easystack ...
. So, I get different results, even without restarting the container...
Also interesting, I've tried a 4th time. Now I get a hanging process. I.e. I see two lt-mpi-bench
processes using ~100% CPU, and having done so for 66 minutes straight. They normally complete much faster. MPI deadlock...?
from fftw3.
I would love a backtrace of both of those processes!
from fftw3.
Great idea... but unfortunately my allocation ended 2 minutes after I noticed the hang :( I'm pretty sure I had process hangs before as well, when I ran into this issue originally. I'll try to run it a couple more times tonight, see if I can trigger it again and get a backtrace...
from fftw3.
Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a neoverse_n1
. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized on neoverse_n1
. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.
Anyway, for now, I'll override myself with export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1
before sourcing the init
script. See where that takes me in terms of build failures, hangs, etc.
from fftw3.
Interesting, now that I correctly use the right dependencies (due to export EESSI_SOFTWARE_SUBDIR_OVERRIDE=aarch64/neoverse_v1
), the failures are suddenly consistent, instead of occassional. Maybe you could give that a try as well: set it after running startprefix
, but before sourcing the initialization script. Also, at this point, you may unset EESSI_SILENT
. That will course the init script to print what architecture is selected (it should respect your override, but it's good to check).
I've run it about 10-15 times now. Each time, it fails with a numerical error like the one above. Now, finally, I've managed to reproduce the hanging 2 processes. Here's the backtrace:
(gdb) bt full
#0 0x000040002c61c604 in opal_timer_linux_get_cycles_sys_timer ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#1 0x000040002c5ccaec in opal_progress_events.isra ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#2 0x000040002c5ccc88 in opal_progress () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libopen-pal.so.40
No symbol table info available.
#3 0x000040002c22babc in ompi_request_default_wait () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#4 0x000040002c27e284 in ompi_coll_base_sendrecv_actual ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#5 0x000040002c27f40c in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#6 0x000040002c27fad4 in ompi_coll_base_allreduce_intra_ring ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#7 0x000040002ea861cc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#8 0x000040002c23b4e8 in PMPI_Allreduce () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/OpenMPI/4.1.4-GCC-11.3.0/lib/libmpi.so.40
No symbol table info available.
#9 0x000040002c0161d0 in fftwf_mpi_any_true ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#10 0x000040002c067648 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#11 0x000040002c06781c in fftwf_mkplan_d ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#12 0x000040002c01ef0c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#13 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#14 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#15 0x000040002c06781c in fftwf_mkplan_d ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#16 0x000040002c01e49c in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#17 0x000040002c0670e8 in search0 () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#18 0x000040002c0673a4 in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#19 0x000040002c0e83ac in mkplan () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#20 0x000040002c0e85a0 in fftwf_mkapiplan ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f.so.3
No symbol table info available.
#21 0x000040002c017aac in fftwf_mpi_plan_guru_r2r ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#22 0x000040002c017bcc in fftwf_mpi_plan_many_r2r ()
from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3f_mpi.so.3
No symbol table info available.
#23 0x0000000000404928 in mkplan ()
No symbol table info available.
#24 0x0000000000405778 in setup ()
No symbol table info available.
#25 0x00000000004085e0 in verify ()
No symbol table info available.
#26 0x0000000000406498 in bench_main ()
No symbol table info available.
#27 0x000040002c346a7c in ?? () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#28 0x000040002c346b4c in __libc_start_main () from /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64/lib/../lib64/libc.so.6
No symbol table info available.
#29 0x0000000000402f30 in _start ()
No symbol table info available.
from fftw3.
Hm, while trying to reproduce my hang (which I didn't succeed in yet), I noticed something: the automatic initialization script from EESSI thinks this node is a
neoverse_n1
. I seem to remember some chatter about this architecture not being detected properly, but thought we fixed that - maybe not. Anyway, it will build against dependencies optimized onneoverse_n1
. I'm pretty sure our build bot overrides this automatic CPU architecture detection, but maybe @boegel can confirm... It would at least point to one difference between what our bot does, and what I get interactively.
Our bot indeed overrides the CPU auto-detection during building, because archspec
is sometimes a bit too pedantic (see for example archspec/archspec-json#38).
In software.eessi.io
we've switched to our own pure bash archdetect
mechanism, which is less pedantic, but that's not used during build either: the build bot just sets $EESSI_SOFTWARE_SUBDIR_OVERRIDE
based on it's configuration.
from fftw3.
Seems like we (you) are making progress! I tried to add your override. Here is my eb config:
buildpath (E) = /tmp/try1/easybuild/build
containerpath (E) = /tmp/try1/easybuild/containers
debug (E) = True
experimental (E) = True
filter-deps (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
filter-env-vars (E) = LD_LIBRARY_PATH
hooks (E) = /tmp/software-layer/eb_hooks.py
ignore-osdeps (E) = True
installpath (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/testing
module-extensions (E) = True
packagepath (E) = /tmp/try1/easybuild/packages
prefix (E) = /tmp/try1/easybuild
read-only-installdir (E) = True
repositorypath (E) = /tmp/try1/easybuild/ebfiles_repo
robot-paths (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/EasyBuild/4.8.1/easybuild/easyconfigs
rpath (E) = True
sourcepath (E) = /tmp/try1/easybuild/sources:
sysroot (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
trace (E) = True
zip-logs (E) = bzip2
But I still don't get failures during testing.
I do think allreduce has the potential to be non-deterministic, however I'm unsure if the ompi_coll_base_allreduce_intra_ring
implementation is or isn't deterministic.
I wonder, is there a way for me to continually run the test without rebuilding each time?
from fftw3.
It is possible. What you could do is stop the EasyBuild installation after a certain point using the --stop
argument. You can do that by editing the yaml file and make it look like this at the end:
- FFTW.MPI-3.3.10-gompi-2022a.eb:
options:
rebuild: True
stop: 'build'
This should stop it after the build step (and before the test step). Then, you'd want to run
eb FFTW.MPI-3.3.10-gompi-2022a.eb --dump-env-script
This will dump a script FFTW.MPI-3.3.10-gompi-2022a.env
that you can source to get the same environment that EasyBuild has during the build. Then, check one of your prior builds (done before you added the 'stop' in the yaml file) to see what command was executed by EasyBuild as its test step
and in which directory. The logs are pretty verbose, so it may be a bit of a puzzle to find, but at least it shows all that information.
Then, source that FFTW.MPI-3.3.10-gompi-2022a.env
, and go to the directory in which EasyBuild normally runs its test step (or an equivalent dir: your tempdir might be different between your stopped build, and the prior build you inspected the logs for. So the prefix might look a little different) and run the command that EasyBuild also ran as 'test step'. That last command, you should be able to put in a loop.
from fftw3.
By the way, your installpath
from the eb --show-config
shows that you are indeed using the neoverse_v1
copy of the software stack (which should be the case since you use the override), so that's good.
I'm absolutely puzzled by why things are different for you than for us. Short from seeing if we could have you test things on our cluster, I don't know what else to try for you to reproduce the failure... :/ I that's something you would be up for, see if you can reach out to @boegel on the EESSI Slack in a DM (join here if you're not yet on that channel), he might be able to arrange it for you.
@boegel maybe you could also do the reverse: spin up a regular VM outside of our Magic Castle setup and see if you can reproduce the issue there? If not, it must be related to our cluster setup somehow...
Also a heads up: I'm going to be on quite a long leave, so won't be able to respond for the next month or so. Again, maybe @boegel can follow up if needed :)
from fftw3.
Thank you for the testing insight and the slack invite. Enjoy the break. I'll talk to @boegel on slack and see what he thinks is a reasonable next step.
from fftw3.
@lrbison When would you like to follow up on this?
from fftw3.
I talked offline with Kenneth.
In the mean time, my pattern-matching neurons fired:
both #334 (comment) and https://gitlab.com/eessi/support/-/issues/24#note_1734228961 have something in common:
Both are in mca_btl_smcuda_component_progress from the smcuda module, but I recall smcuda should really only be engaged when CUDA/ROCm/{accelerator} memory is used, otherwise we should be using the SM BTL. I'll follow up on that.
Another similarity is that although the fftw backtrace is just form a sendrecv, the hang was stopped during allreduce, and both OpenFOAM and FFTW cases were doing ompi_coll_base_allreduce_intra_recursivedoubling. However my gut tells me it's not the reduction at fault but rather the progress engine, (partially because I know for a fact we are testing that allreduce function daily without issue).
from fftw3.
Moving the rest of this discussion to https://gitlab.com/eessi/support/-/issues/41
from fftw3.
The root cause was open-mpi/ompi#12270 Fixed in open-mpi/ompi#12338, so this issue can be closed.
from fftw3.
For Neoverse V1 users, if you can also try and report on the release-for-testing in #315 it would be useful to get SVE support upstream.
from fftw3.
Closing as requested.
from fftw3.
Related Issues (20)
- Error encountered when running 'make' after modifying Makefile.am
- Assessment of the difficulty in porting CPU architecture for fftw3 HOT 1
- FFTW SIMD Support HOT 1
- non-deterministic wisdom output HOT 4
- Link FFTW in Android Studio
- Fftw build failed when open ENABLE_OPENMP HOT 1
- Segmentation fault in check()
- Support DCT/DST-V-VIII HOT 1
- Smallbin double linked list corruption for specific data HOT 2
- FFTW threads: ld: error: undefined symbol: fftwf_threads_set_callback HOT 3
- `afft`?
- IFFT error HOT 3
- Compile application error
- Compilation Error HOT 3
- SVE Imlementation HOT 1
- Error: Unbound module "Pervasives" HOT 3
- "fftw_execute " case "Segmentation fault" HOT 4
- GPL warning HOT 4
- default flags and how to use inplace HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fftw3.