pnnl / proxytsprd Goto Github PK
View Code? Open in Web Editor NEWProxy application for analyzing dynamical systems.
License: BSD 3-Clause "New" or "Revised" License
Proxy application for analyzing dynamical systems.
License: BSD 3-Clause "New" or "Revised" License
Describe the current behavior
The program is going in a deadlock where threads keep waiting for one (or multiple) ranks. We use mpirun.
Describe the expected behavior
Since all the processes are monitoring the gpu behavior and processing similar amount of data, the load on all processes should be equal and deadlock shouldn't happen.
Standalone code to reproduce the issue
Not yet developed
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
[2021-11-18 21:35:46.256559: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:36:46.257429: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:37:46.257757: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:38:46.257948: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
Horovod or DDP initialization should happen right in the beginning to ensure that printing happens only from one process right from the beginning.
Printing is good for debugging but for professional development we will have to move to python logging module.
Currently, ProxyTSPRD only supports power systems data. Next, we need to add the climate data.
When using fp16 data type, the training is resulting in nan errors. PyTorch forum suggests to use automatic mixed precision instead of fp16 - https://discuss.pytorch.org/t/how-to-avoid-nan-loss-when-using-fp16-training/151665.
Given that it requires data conversion to fp32, enabling fp16 will be a future enhancement for ProxyTSPRD. For now, the supported data types are fp32, fp64, and amp (automatic mixed precision)
The TensorFlow version is not yet implemented fully. It should run with Horovod and not with DDP.
Describe the current behavior
Unexpected error when running tensorflow code on all the GPUs of a node.
Describe the expected behavior
The code should run without an error.
Standalone code to reproduce the issue
https://github.com/pnnl/ProxyTSPRD/blob/master/examples/test_tfhorovod.py
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
2021-11-19 12:32:23.909021: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216667] *** Process received signal ***
[a100-02:216667] Signal: Aborted (6)
[a100-02:216667] Signal code: (-6)
2021-11-19 12:32:23.911684: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-19 12:32:23.912414: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2245910000 Hz
2021-11-19 12:32:23.912408: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-19 12:32:23.913016: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2245910000 Hz
2021-11-19 12:32:23.922883: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216666] *** Process received signal ***
[a100-02:216666] Signal: Aborted (6)
[a100-02:216666] Signal code: (-6)
2021-11-19 12:32:23.922979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37524 MB memory) -> physical GPU (device: 2, name: A100-SXM4-40GB, pci bus id: 0000:47:00.0, compute capability: 8.0)
2021-11-19 12:32:23.923407: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216662] *** Process received signal ***
[a100-02:216662] Signal: Aborted (6)
[a100-02:216662] Signal code: (-6)
2021-11-19 12:32:23.923954: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216660] *** Process received signal ***
[a100-02:216660] Signal: Aborted (6)
[a100-02:216660] Signal code: (-6)
2021-11-19 12:32:23.943334: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216665] *** Process received signal ***
[a100-02:216665] Signal: Aborted (6)
[a100-02:216665] Signal code: (-6)
2021-11-19 12:32:23.952735: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216663] *** Process received signal ***
[a100-02:216663] Signal: Aborted (6)
[a100-02:216663] Signal code: (-6)
[a100-02:216662] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2acd96a6b630]
[a100-02:216662] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2acd96cae387]
[a100-02:216662] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2acd96cafa78]
[a100-02:216662] [ 3] [a100-02:216663] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b7edde6f630]
[a100-02:216663] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7ede0b2387]
[a100-02:216663] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7ede0b3a78]
[a100-02:216663] [ 3] [a100-02:216665] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b3db30cc630]
[a100-02:216665] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3db330f387]
[a100-02:216665] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3db3310a78]
[a100-02:216665] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b80fb885704]
[a100-02:216663] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b3fd0ae2704]
[a100-02:216665] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2acfb4481704]
[a100-02:216662] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b8114372a88]
[a100-02:216663] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b3fe95cfa88]
[a100-02:216665] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2acfccf6ea88]
[a100-02:216662] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b8114372b01]
[a100-02:216663] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b3fe95cfb01]
[a100-02:216665] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2acfccf6eb01]
[a100-02:216662] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b80f22f69dd]
[a100-02:216663] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b3fc75539dd]
[a100-02:216665] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2acfaaef29dd]
[a100-02:216662] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b80f22f828f]
[a100-02:216663] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b3fc755528f]
[a100-02:216665] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2acfaaef428f]
[a100-02:216662] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2b80f22f8989]
[a100-02:216663] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2b3fc7555989]
[a100-02:216665] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2acfaaef4989]
[a100-02:216662] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor14StreamExecutorC1EPKNS_8PlatformESt10unique_ptrINS_8internal23StreamExecutorInterfaceESt14default_deleteIS6_EEi+0xba)[0x2b80fb57a89a]
[a100-02:216663] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor14StreamExecutorC1EPKNS_8PlatformESt10unique_ptrINS_8internal23StreamExecutorInterfaceESt14default_deleteIS6_EEi+0xba)[0x2b3fd07d789a]
[a100-02:216665] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgrC1EPN15stream_executor14StreamExecutorERKNS_10GPUOptionsE+0x157)[0x2acfb42b97f7]
[a100-02:216662] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE+0x1e5)[0x2b8114858945]
[a100-02:216663] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE+0x1e5)[0x2b3fe9ab5945]
[a100-02:216665] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x163f5ac)[0x2b81148575ac]
[a100-02:216663] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x163f5ac)[0x2b3fe9ab45ac]
[a100-02:216665] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow15EventMgrFactory11GetEventMgrEPN15stream_executor14StreamExecutorERKNS_10GPUOptionsE+0x80)[0x2acfb42b98e0]
[a100-02:216662] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice4InitERKNS_14SessionOptionsE+0x230)[0x2acfccdc9c30]
[a100-02:216662] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE+0x337)[0x2b80fb5834d7]
[a100-02:216663] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE+0x337)[0x2b3fd07e04d7]
[a100-02:216665] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x4b)[0x2b811485766b]
[a100-02:216663] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory15CreateGPUDeviceERKNS_14SessionOptionsERKSsNS_3gtl7IntTypeINS_12TfGpuId_tag_EiEExRKNS_14DeviceLocalityEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISF_EESaISI_EE+0x60b)[0x2acfccdcf20b]
[a100-02:216662] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x4b)[0x2b3fe9ab466b]
[a100-02:216665] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform17ExecutorForDeviceEi+0x8f)[0x2b8114859aaf]
[a100-02:216663] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2ac4)[0x2acfccdd3354]
[a100-02:216662] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform17ExecutorForDeviceEi+0x8f)[0x2b3fe9ab6aaf]
[a100-02:216665] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory19GetInterconnectMapsERKSt6vectorINS_3gtl7IntTypeINS_18PlatformGpuId_tag_EiEESaIS5_EEPN15stream_executor8PlatformEPS1_INS0_15InterconnectMapESaISD_EE+0xc4)[0x2b81141cca24]
[a100-02:216663] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2acfccba0f4d]
[a100-02:216662] [15] [a100-02:216664] [ 0] [a100-02:216667] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b9cbbfd3630]
[a100-02:216664] [ 1] /usr/lib64/libpthread.so.0(+0xf630)[0x2ab3579bd630]
[a100-02:216667] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b9cbc216387]
[a100-02:216664] [ 2] /usr/lib64/libc.so.6(gsignal+0x37)[0x2ab357c00387]
[a100-02:216667] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b9cbc217a78]
[a100-02:216664] [ 3] /usr/lib64/libc.so.6(abort+0x148)[0x2ab357c01a78]
[a100-02:216667] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory19GetInterconnectMapsERKSt6vectorINS_3gtl7IntTypeINS_18PlatformGpuId_tag_EiEESaIS5_EEPN15stream_executor8PlatformEPS1_INS0_15InterconnectMapESaISD_EE+0xc4)[0x2b3fe9429a24]
[a100-02:216665] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2d2)[0x2b81141d4b62]
[a100-02:216663] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2acfaaea7a17]
[a100-02:216662] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2acfce3ead25]
[a100-02:216662] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2acfce3e86b2]
[a100-02:216662] [18] python(+0x13c7ae)[0x55dedc5b97ae]
[a100-02:216662] [19] python(_PyObject_MakeTpCall+0x3bf)[0x55dedc5ae25f]
[a100-02:216662] [20] python(_PyEval_EvalFrameDefault+0x5437)[0x55dedc657e87]
[a100-02:216662] [21] python(_PyFunction_Vectorcall+0x1b7)[0x55dedc64a3d7]
[a100-02:216662] [22] python(_PyEval_EvalFrameDefault+0x4bf)[0x55dedc652f0f]
[a100-02:216662] [23] python(_PyFunction_Vectorcall+0x1b7)[0x55dedc64a3d7]
[a100-02:216662] [24] python(_PyEval_EvalFrameDefault+0x4f81)[0x55dedc6579d1]
[a100-02:216662] [25] python(_PyEval_EvalCodeWithName+0xd5f)[0x55dedc649cef]
[a100-02:216662] [26] python(_PyFunction_Vectorcall+0x594)[0x55dedc64a7b4]
[a100-02:216662] [27] python(+0x1b9418)[0x55dedc636418]
[a100-02:216662] [28] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2d2)[0x2b3fe9431b62]
[a100-02:216665] [17] python(_PyObject_MakeTpCall+0x228)[0x55dedc5ae0c8]
[a100-02:216662] [29] python(_PyEval_EvalFrameDefault+0x4ef0)[0x55dedc657940]
[a100-02:216662] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2b8113fa4f4d]
[a100-02:216663] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2ab5753d3704]
[a100-02:216667] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b9ed99e9704]
[a100-02:216664] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2b3fe9201f4d]
[a100-02:216665] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b9ef24d6a88]
[a100-02:216664] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b9ef24d6b01]
[a100-02:216664] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2b80f22aba17]
[a100-02:216663] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b81157eed25]
[a100-02:216663] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b81157ec6b2]
[a100-02:216663] [21] python(+0x13c7ae)[0x55d4e2a387ae]
[a100-02:216663] [22] python(_PyObject_MakeTpCall+0x3bf)[0x55d4e2a2d25f]
[a100-02:216663] [23] python(_PyEval_EvalFrameDefault+0x5437)[0x55d4e2ad6e87]
[a100-02:216663] [24] python(_PyFunction_Vectorcall+0x1b7)[0x55d4e2ac93d7]
[a100-02:216663] [25] python(_PyEval_EvalFrameDefault+0x4bf)[0x55d4e2ad1f0f]
[a100-02:216663] [26] python(_PyFunction_Vectorcall+0x1b7)[0x55d4e2ac93d7]
[a100-02:216663] [27] python(_PyEval_EvalFrameDefault+0x4f81)[0x55d4e2ad69d1]
[a100-02:216663] [28] python(_PyEval_EvalCodeWithName+0xd5f)[0x55d4e2ac8cef]
[a100-02:216663] [29] python(_PyFunction_Vectorcall+0x594)[0x55d4e2ac97b4]
[a100-02:216663] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2ab58dec0a88]
[a100-02:216667] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2b3fc7508a17]
[a100-02:216665] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b3feaa4bd25]
[a100-02:216665] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b3feaa496b2]
[a100-02:216665] [21] python(+0x13c7ae)[0x562ea73757ae]
[a100-02:216665] [22] python(_PyObject_MakeTpCall+0x3bf)[0x562ea736a25f]
[a100-02:216665] [23] python(_PyEval_EvalFrameDefault+0x5437)[0x562ea7413e87]
[a100-02:216665] [24] python(_PyFunction_Vectorcall+0x1b7)[0x562ea74063d7]
[a100-02:216665] [25] python(_PyEval_EvalFrameDefault+0x4bf)[0x562ea740ef0f]
[a100-02:216665] [26] python(_PyFunction_Vectorcall+0x1b7)[0x562ea74063d7]
[a100-02:216665] [27] python(_PyEval_EvalFrameDefault+0x4f81)[0x562ea74139d1]
[a100-02:216665] [28] python(_PyEval_EvalCodeWithName+0xd5f)[0x562ea7405cef]
[a100-02:216665] [29] python(_PyFunction_Vectorcall+0x594)[0x562ea74067b4]
[a100-02:216665] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b9ed045a9dd]
[a100-02:216664] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2ab58dec0b01]
[a100-02:216667] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b9ed045c28f]
[a100-02:216664] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow31NewThreadPoolFromSessionOptionsERKNS_14SessionOptionsE+0xcb)[0x2b9ef2497c3b]
[a100-02:216664] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2ab56be449dd]
[a100-02:216667] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerContextC2ERKNS_14SessionOptionsENS_28ContextDevicePlacementPolicyEbbPKNS_9DeviceMgrEbPNS_10RendezvousEPNS_33DistributedFunctionLibraryRuntimeE+0x297)[0x2b9ed61bac07]
[a100-02:216664] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2ab56be4628f]
[a100-02:216667] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow31NewThreadPoolFromSessionOptionsERKNS_14SessionOptionsE+0xcb)[0x2ab58de81c3b]
[a100-02:216667] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x24e)[0x2b9ed040fbce]
[a100-02:216664] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b9ef3952d25]
[a100-02:216664] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b9ef39506b2]
[a100-02:216664] [13] python(+0x13c7ae)[0x5617cccb47ae]
[a100-02:216664] [14] python(_PyObject_MakeTpCall+0x3bf)[0x5617ccca925f]
[a100-02:216664] [15] python(_PyEval_EvalFrameDefault+0x5437)[0x5617ccd52e87]
[a100-02:216664] [16] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [17] python(_PyEval_EvalFrameDefault+0x4bf)[0x5617ccd4df0f]
[a100-02:216664] [18] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [19] python(_PyEval_EvalFrameDefault+0x4f81)[0x5617ccd529d1]
[a100-02:216664] [20] python(_PyEval_EvalCodeWithName+0xd5f)[0x5617ccd44cef]
[a100-02:216664] [21] python(_PyFunction_Vectorcall+0x594)[0x5617ccd457b4]
[a100-02:216664] [22] python(+0x1b9418)[0x5617ccd31418]
[a100-02:216664] [23] python(_PyObject_MakeTpCall+0x228)[0x5617ccca90c8]
[a100-02:216664] [24] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5617ccd52940]
[a100-02:216664] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [26] python(+0x1b9318)[0x5617ccd31318]
[a100-02:216664] [27] python(_PyObject_MakeTpCall+0x228)[0x5617ccca90c8]
[a100-02:216664] [28] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5617ccd52940]
[a100-02:216664] [29] python(_PyEval_EvalCodeWithName+0x260)[0x5617ccd441f0]
[a100-02:216664] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerContextC2ERKNS_14SessionOptionsENS_28ContextDevicePlacementPolicyEbbPKNS_9DeviceMgrEbPNS_10RendezvousEPNS_33DistributedFunctionLibraryRuntimeE+0x297)[0x2ab571ba4c07]
[a100-02:216667] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x24e)[0x2ab56bdf9bce]
[a100-02:216667] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2ab58f33cd25]
[a100-02:216667] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2ab58f33a6b2]
[a100-02:216667] [13] python(+0x13c7ae)[0x5653fa3c57ae]
[a100-02:216667] [14] python(_PyObject_MakeTpCall+0x3bf)[0x5653fa3ba25f]
[a100-02:216667] [15] python(_PyEval_EvalFrameDefault+0x5437)[0x5653fa463e87]
[a100-02:216667] [16] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [17] python(_PyEval_EvalFrameDefault+0x4bf)[0x5653fa45ef0f]
[a100-02:216667] [18] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [19] python(_PyEval_EvalFrameDefault+0x4f81)[0x5653fa4639d1]
[a100-02:216667] [20] python(_PyEval_EvalCodeWithName+0xd5f)[0x5653fa455cef]
[a100-02:216667] [21] python(_PyFunction_Vectorcall+0x594)[0x5653fa4567b4]
[a100-02:216667] [22] python(+0x1b9418)[0x5653fa442418]
[a100-02:216667] [23] python(_PyObject_MakeTpCall+0x228)[0x5653fa3ba0c8]
[a100-02:216667] [24] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5653fa463940]
[a100-02:216667] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [26] python(+0x1b9318)[0x5653fa442318]
[a100-02:216667] [27] python(_PyObject_MakeTpCall+0x228)[0x5653fa3ba0c8]
[a100-02:216667] [28] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5653fa463940]
[a100-02:216667] [29] python(_PyEval_EvalCodeWithName+0x260)[0x5653fa4551f0]
[a100-02:216667] *** End of error message ***
[a100-02:216666] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b4a269ca630]
[a100-02:216666] [ 1] [a100-02:216660] [ 0] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b4a26c0d387]
[a100-02:216666] [ 2] /usr/lib64/libpthread.so.0(+0xf630)[0x2b0b8e4f1630]
[a100-02:216660] [ 1] /usr/lib64/libc.so.6(abort+0x148)[0x2b4a26c0ea78]
[a100-02:216666] [ 3] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b0b8e734387]
[a100-02:216660] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b0b8e735a78]
[a100-02:216660] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b4c443e0704]
[a100-02:216666] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b0dabf07704]
[a100-02:216660] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b4c5cecda88]
[a100-02:216666] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b4c5cecdb01]
[a100-02:216666] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18AsyncSingletonImpl25StartInitializationThreadEPNS_24LoggerSingletonContainerE+0x32a)[0x2b4c5cebdf2a]
[a100-02:216666] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN4absl14lts_2020_02_2513base_internal12CallOnceImplIRFvPN10tensorflow24LoggerSingletonContainerEEJRS5_EEEvPSt6atomicIjENS1_14SchedulingModeEOT_DpOT0_+0x2a)[0x2b4c5cebe2ba]
[a100-02:216666] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow6Logger17GetSingletonAsyncEv+0x60)[0x2b4c5cebe370]
[a100-02:216666] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b0dc49f4a88]
[a100-02:216660] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b0dc49f4b01]
[a100-02:216660] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18AsyncSingletonImpl25StartInitializationThreadEPNS_24LoggerSingletonContainerE+0x32a)[0x2b0dc49e4f2a]
[a100-02:216660] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN4absl14lts_2020_02_2513base_internal12CallOnceImplIRFvPN10tensorflow24LoggerSingletonContainerEEJRS5_EEEvPSt6atomicIjENS1_14SchedulingModeEOT_DpOT0_+0x2a)[0x2b0dc49e52ba]
[a100-02:216660] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow6Logger17GetSingletonAsyncEv+0x60)[0x2b0dc49e5370]
[a100-02:216660] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x9085a7f)[0x2b4c40bcba7f]
[a100-02:216666] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x9085a7f)[0x2b0da86f2a7f]
[a100-02:216660] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20BroadcastXlaActivityENS_25XlaAutoClusteringActivityE+0x57)[0x2b4c41734587]
[a100-02:216666] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20BroadcastXlaActivityENS_25XlaAutoClusteringActivityE+0x57)[0x2b0da925b587]
[a100-02:216660] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow24ReportClusteringInfoPass3RunERKNS_28GraphOptimizationPassOptionsE+0x93)[0x2b4c40c57a53]
[a100-02:216666] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow24OptimizationPassRegistry11RunGroupingENS0_8GroupingERKNS_28GraphOptimizationPassOptionsE+0x1a6)[0x2b4c5cdb2b06]
[a100-02:216666] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime22InstantiateMultiDeviceERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0x13c5)[0x2b4c5cd938c5]
[a100-02:216666] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime11InstantiateERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0xc3)[0x2b4c5cd96053]
[a100-02:216666] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc15InstantiateFuncERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1042)[0x2b4c40bc0702]
[a100-02:216666] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow24ReportClusteringInfoPass3RunERKNS_28GraphOptimizationPassOptionsE+0x93)[0x2b0da877ea53]
[a100-02:216660] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow24OptimizationPassRegistry11RunGroupingENS0_8GroupingERKNS_28GraphOptimizationPassOptionsE+0x1a6)[0x2b0dc48d9b06]
[a100-02:216660] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime22InstantiateMultiDeviceERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0x13c5)[0x2b0dc48ba8c5]
[a100-02:216660] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc4InitERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1f)[0x2b4c40bc165f]
[a100-02:216666] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902eaf1)[0x2b4c40b74af1]
[a100-02:216666] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime11InstantiateERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0xc3)[0x2b0dc48bd053]
[a100-02:216660] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902fc95)[0x2b4c40b75c95]
[a100-02:216666] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc15InstantiateFuncERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1042)[0x2b0da86e7702]
[a100-02:216660] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi+0x180)[0x2b4c40b775c0]
[a100-02:216666] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc4InitERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1f)[0x2b0da86e865f]
[a100-02:216660] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow14EagerOperation7ExecuteEN4absl14lts_2020_02_254SpanIPNS_20AbstractTensorHandleEEEPi+0x18c)[0x2b4c40b61dbc]
[a100-02:216666] [21] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902eaf1)[0x2b0da869baf1]
[a100-02:216660] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_Execute+0x26)[0x2b4c3ae07a56]
[a100-02:216666] [22] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl14lts_2020_02_2513InlinedVectorIP16TFE_TensorHandleLm4ESaIS7_EEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ES8_EEP9TF_Status+0x4b5)[0x2b4c3ad811a5]
[a100-02:216666] [23] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x2f557)[0x2b4c5e340557]
[a100-02:216666] [24] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x3101b)[0x2b4c5e34201b]
[a100-02:216666] [25] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b4c5e3476b2]
[a100-02:216666] [26] python(+0x13c7ae)[0x558ff4bcf7ae]
[a100-02:216666] [27] python(_PyObject_MakeTpCall+0x3bf)[0x558ff4bc425f]
[a100-02:216666] [28] python(_PyEval_EvalFrameDefault+0x5437)[0x558ff4c6de87]
[a100-02:216666] [29] python(_PyEval_EvalCodeWithName+0x260)[0x558ff4c5f1f0]
[a100-02:216666] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902fc95)[0x2b0da869cc95]
[a100-02:216660] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi+0x180)[0x2b0da869e5c0]
[a100-02:216660] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow14EagerOperation7ExecuteEN4absl14lts_2020_02_254SpanIPNS_20AbstractTensorHandleEEEPi+0x18c)[0x2b0da8688dbc]
[a100-02:216660] [21] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_Execute+0x26)[0x2b0da292ea56]
[a100-02:216660] [22] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl14lts_2020_02_2513InlinedVectorIP16TFE_TensorHandleLm4ESaIS7_EEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ES8_EEP9TF_Status+0x4b5)[0x2b0da28a81a5]
[a100-02:216660] [23] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x2f557)[0x2b0dc5e67557]
[a100-02:216660] [24] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x3101b)[0x2b0dc5e6901b]
[a100-02:216660] [25] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b0dc5e6e6b2]
[a100-02:216660] [26] python(+0x13c7ae)[0x55674eb8b7ae]
[a100-02:216660] [27] python(_PyObject_MakeTpCall+0x3bf)[0x55674eb8025f]
[a100-02:216660] [28] python(_PyEval_EvalFrameDefault+0x5437)[0x55674ec29e87]
[a100-02:216660] [29] python(_PyEval_EvalCodeWithName+0x260)[0x55674ec1b1f0]
[a100-02:216660] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 216667 on node a100-02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.