Giter Club home page Giter Club logo

proxytsprd's People

Contributors

devcentral-pnnl avatar jainmilan avatar sg0 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

proxytsprd's Issues

Nvidia nsight (nsys) with Horovod (using mpirun) resulting in deadlock

  • TensorFlow version (use command below): 2.4.1
  • PyTorch version (use command below): 1.9.0
  • Python version: 3.8.4
  • GCC/Compiler version (if compiling from source): 5.2.0
  • OpenMPI version: 4.1.0
  • CUDA/cuDNN version: 11.0
  • GPU model and memory: A100

Describe the current behavior
The program is going in a deadlock where threads keep waiting for one (or multiple) ranks. We use mpirun.

Describe the expected behavior
Since all the processes are monitoring the gpu behavior and processing similar amount of data, the load on all processes should be equal and deadlock shouldn't happen.

Standalone code to reproduce the issue
Not yet developed

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

[2021-11-18 21:35:46.256559: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:36:46.257429: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:37:46.257757: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:38:46.257948: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]

Printing from multiple processes

Horovod or DDP initialization should happen right in the beginning to ensure that printing happens only from one process right from the beginning.

TensorFlow Multi-GPU not running on A100

  • TensorFlow version (use command below): 2.4.1
  • PyTorch version (use command below): 1.9.0
  • Python version: 3.8.4
  • GCC/Compiler version (if compiling from source): 5.2.0
  • OpenMPI version: 4.1.0
  • CUDA/cuDNN version: 11.0
  • GPU model and memory: A100

Describe the current behavior
Unexpected error when running tensorflow code on all the GPUs of a node.

Describe the expected behavior
The code should run without an error.

Standalone code to reproduce the issue
https://github.com/pnnl/ProxyTSPRD/blob/master/examples/test_tfhorovod.py

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

2021-11-19 12:32:23.909021: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216667] *** Process received signal ***
[a100-02:216667] Signal: Aborted (6)
[a100-02:216667] Signal code:  (-6)
2021-11-19 12:32:23.911684: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-19 12:32:23.912414: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2245910000 Hz
2021-11-19 12:32:23.912408: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-19 12:32:23.913016: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2245910000 Hz
2021-11-19 12:32:23.922883: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216666] *** Process received signal ***
[a100-02:216666] Signal: Aborted (6)
[a100-02:216666] Signal code:  (-6)
2021-11-19 12:32:23.922979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37524 MB memory) -> physical GPU (device: 2, name: A100-SXM4-40GB, pci bus id: 0000:47:00.0, compute capability: 8.0)
2021-11-19 12:32:23.923407: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216662] *** Process received signal ***
[a100-02:216662] Signal: Aborted (6)
[a100-02:216662] Signal code:  (-6)
2021-11-19 12:32:23.923954: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216660] *** Process received signal ***
[a100-02:216660] Signal: Aborted (6)
[a100-02:216660] Signal code:  (-6)
2021-11-19 12:32:23.943334: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216665] *** Process received signal ***
[a100-02:216665] Signal: Aborted (6)
[a100-02:216665] Signal code:  (-6)
2021-11-19 12:32:23.952735: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
[a100-02:216663] *** Process received signal ***
[a100-02:216663] Signal: Aborted (6)
[a100-02:216663] Signal code:  (-6)
[a100-02:216662] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2acd96a6b630]
[a100-02:216662] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2acd96cae387]
[a100-02:216662] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2acd96cafa78]
[a100-02:216662] [ 3] [a100-02:216663] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b7edde6f630]
[a100-02:216663] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7ede0b2387]
[a100-02:216663] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7ede0b3a78]
[a100-02:216663] [ 3] [a100-02:216665] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b3db30cc630]
[a100-02:216665] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3db330f387]
[a100-02:216665] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3db3310a78]
[a100-02:216665] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b80fb885704]
[a100-02:216663] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b3fd0ae2704]
[a100-02:216665] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2acfb4481704]
[a100-02:216662] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b8114372a88]
[a100-02:216663] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b3fe95cfa88]
[a100-02:216665] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2acfccf6ea88]
[a100-02:216662] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b8114372b01]
[a100-02:216663] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b3fe95cfb01]
[a100-02:216665] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2acfccf6eb01]
[a100-02:216662] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b80f22f69dd]
[a100-02:216663] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b3fc75539dd]
[a100-02:216665] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2acfaaef29dd]
[a100-02:216662] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b80f22f828f]
[a100-02:216663] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b3fc755528f]
[a100-02:216665] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2acfaaef428f]
[a100-02:216662] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2b80f22f8989]
[a100-02:216663] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2b3fc7555989]
[a100-02:216665] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKSsi+0x39)[0x2acfaaef4989]
[a100-02:216662] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor14StreamExecutorC1EPKNS_8PlatformESt10unique_ptrINS_8internal23StreamExecutorInterfaceESt14default_deleteIS6_EEi+0xba)[0x2b80fb57a89a]
[a100-02:216663] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor14StreamExecutorC1EPKNS_8PlatformESt10unique_ptrINS_8internal23StreamExecutorInterfaceESt14default_deleteIS6_EEi+0xba)[0x2b3fd07d789a]
[a100-02:216665] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgrC1EPN15stream_executor14StreamExecutorERKNS_10GPUOptionsE+0x157)[0x2acfb42b97f7]
[a100-02:216662] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE+0x1e5)[0x2b8114858945]
[a100-02:216663] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform19GetUncachedExecutorERKNS_20StreamExecutorConfigE+0x1e5)[0x2b3fe9ab5945]
[a100-02:216665] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x163f5ac)[0x2b81148575ac]
[a100-02:216663] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x163f5ac)[0x2b3fe9ab45ac]
[a100-02:216665] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow15EventMgrFactory11GetEventMgrEPN15stream_executor14StreamExecutorERKNS_10GPUOptionsE+0x80)[0x2acfb42b98e0]
[a100-02:216662] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice4InitERKNS_14SessionOptionsE+0x230)[0x2acfccdc9c30]
[a100-02:216662] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE+0x337)[0x2b80fb5834d7]
[a100-02:216663] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN15stream_executor13ExecutorCache11GetOrCreateERKNS_20StreamExecutorConfigERKSt8functionIFNS_4port8StatusOrISt10unique_ptrINS_14StreamExecutorESt14default_deleteIS8_EEEEvEE+0x337)[0x2b3fd07e04d7]
[a100-02:216665] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x4b)[0x2b811485766b]
[a100-02:216663] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory15CreateGPUDeviceERKNS_14SessionOptionsERKSsNS_3gtl7IntTypeINS_12TfGpuId_tag_EiEExRKNS_14DeviceLocalityEPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteISF_EESaISI_EE+0x60b)[0x2acfccdcf20b]
[a100-02:216662] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform11GetExecutorERKNS_20StreamExecutorConfigE+0x4b)[0x2b3fe9ab466b]
[a100-02:216665] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform17ExecutorForDeviceEi+0x8f)[0x2b8114859aaf]
[a100-02:216663] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2ac4)[0x2acfccdd3354]
[a100-02:216662] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudaPlatform17ExecutorForDeviceEi+0x8f)[0x2b3fe9ab6aaf]
[a100-02:216665] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory19GetInterconnectMapsERKSt6vectorINS_3gtl7IntTypeINS_18PlatformGpuId_tag_EiEESaIS5_EEPN15stream_executor8PlatformEPS1_INS0_15InterconnectMapESaISD_EE+0xc4)[0x2b81141cca24]
[a100-02:216663] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2acfccba0f4d]
[a100-02:216662] [15] [a100-02:216664] [ 0] [a100-02:216667] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b9cbbfd3630]
[a100-02:216664] [ 1] /usr/lib64/libpthread.so.0(+0xf630)[0x2ab3579bd630]
[a100-02:216667] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b9cbc216387]
[a100-02:216664] [ 2] /usr/lib64/libc.so.6(gsignal+0x37)[0x2ab357c00387]
[a100-02:216667] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b9cbc217a78]
[a100-02:216664] [ 3] /usr/lib64/libc.so.6(abort+0x148)[0x2ab357c01a78]
[a100-02:216667] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory19GetInterconnectMapsERKSt6vectorINS_3gtl7IntTypeINS_18PlatformGpuId_tag_EiEESaIS5_EEPN15stream_executor8PlatformEPS1_INS0_15InterconnectMapESaISD_EE+0xc4)[0x2b3fe9429a24]
[a100-02:216665] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2d2)[0x2b81141d4b62]
[a100-02:216663] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2acfaaea7a17]
[a100-02:216662] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2acfce3ead25]
[a100-02:216662] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2acfce3e86b2]
[a100-02:216662] [18] python(+0x13c7ae)[0x55dedc5b97ae]
[a100-02:216662] [19] python(_PyObject_MakeTpCall+0x3bf)[0x55dedc5ae25f]
[a100-02:216662] [20] python(_PyEval_EvalFrameDefault+0x5437)[0x55dedc657e87]
[a100-02:216662] [21] python(_PyFunction_Vectorcall+0x1b7)[0x55dedc64a3d7]
[a100-02:216662] [22] python(_PyEval_EvalFrameDefault+0x4bf)[0x55dedc652f0f]
[a100-02:216662] [23] python(_PyFunction_Vectorcall+0x1b7)[0x55dedc64a3d7]
[a100-02:216662] [24] python(_PyEval_EvalFrameDefault+0x4f81)[0x55dedc6579d1]
[a100-02:216662] [25] python(_PyEval_EvalCodeWithName+0xd5f)[0x55dedc649cef]
[a100-02:216662] [26] python(_PyFunction_Vectorcall+0x594)[0x55dedc64a7b4]
[a100-02:216662] [27] python(+0x1b9418)[0x55dedc636418]
[a100-02:216662] [28] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow20BaseGPUDeviceFactory13CreateDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0x2d2)[0x2b3fe9431b62]
[a100-02:216665] [17] python(_PyObject_MakeTpCall+0x228)[0x55dedc5ae0c8]
[a100-02:216662] [29] python(_PyEval_EvalFrameDefault+0x4ef0)[0x55dedc657940]
[a100-02:216662] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2b8113fa4f4d]
[a100-02:216663] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2ab5753d3704]
[a100-02:216667] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b9ed99e9704]
[a100-02:216664] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow13DeviceFactory10AddDevicesERKNS_14SessionOptionsERKSsPSt6vectorISt10unique_ptrINS_6DeviceESt14default_deleteIS8_EESaISB_EE+0xed)[0x2b3fe9201f4d]
[a100-02:216665] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b9ef24d6a88]
[a100-02:216664] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b9ef24d6b01]
[a100-02:216664] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2b80f22aba17]
[a100-02:216663] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b81157eed25]
[a100-02:216663] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b81157ec6b2]
[a100-02:216663] [21] python(+0x13c7ae)[0x55d4e2a387ae]
[a100-02:216663] [22] python(_PyObject_MakeTpCall+0x3bf)[0x55d4e2a2d25f]
[a100-02:216663] [23] python(_PyEval_EvalFrameDefault+0x5437)[0x55d4e2ad6e87]
[a100-02:216663] [24] python(_PyFunction_Vectorcall+0x1b7)[0x55d4e2ac93d7]
[a100-02:216663] [25] python(_PyEval_EvalFrameDefault+0x4bf)[0x55d4e2ad1f0f]
[a100-02:216663] [26] python(_PyFunction_Vectorcall+0x1b7)[0x55d4e2ac93d7]
[a100-02:216663] [27] python(_PyEval_EvalFrameDefault+0x4f81)[0x55d4e2ad69d1]
[a100-02:216663] [28] python(_PyEval_EvalCodeWithName+0xd5f)[0x55d4e2ac8cef]
[a100-02:216663] [29] python(_PyFunction_Vectorcall+0x594)[0x55d4e2ac97b4]
[a100-02:216663] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2ab58dec0a88]
[a100-02:216667] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x97)[0x2b3fc7508a17]
[a100-02:216665] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b3feaa4bd25]
[a100-02:216665] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b3feaa496b2]
[a100-02:216665] [21] python(+0x13c7ae)[0x562ea73757ae]
[a100-02:216665] [22] python(_PyObject_MakeTpCall+0x3bf)[0x562ea736a25f]
[a100-02:216665] [23] python(_PyEval_EvalFrameDefault+0x5437)[0x562ea7413e87]
[a100-02:216665] [24] python(_PyFunction_Vectorcall+0x1b7)[0x562ea74063d7]
[a100-02:216665] [25] python(_PyEval_EvalFrameDefault+0x4bf)[0x562ea740ef0f]
[a100-02:216665] [26] python(_PyFunction_Vectorcall+0x1b7)[0x562ea74063d7]
[a100-02:216665] [27] python(_PyEval_EvalFrameDefault+0x4f81)[0x562ea74139d1]
[a100-02:216665] [28] python(_PyEval_EvalCodeWithName+0xd5f)[0x562ea7405cef]
[a100-02:216665] [29] python(_PyFunction_Vectorcall+0x594)[0x562ea74067b4]
[a100-02:216665] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2b9ed045a9dd]
[a100-02:216664] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2ab58dec0b01]
[a100-02:216667] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2b9ed045c28f]
[a100-02:216664] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow31NewThreadPoolFromSessionOptionsERKNS_14SessionOptionsE+0xcb)[0x2b9ef2497c3b]
[a100-02:216664] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEEC1EibS3_+0x72d)[0x2ab56be449dd]
[a100-02:216667] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerContextC2ERKNS_14SessionOptionsENS_28ContextDevicePlacementPolicyEbbPKNS_9DeviceMgrEbPNS_10RendezvousEPNS_33DistributedFunctionLibraryRuntimeE+0x297)[0x2b9ed61bac07]
[a100-02:216664] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow6thread10ThreadPoolC2EPNS_3EnvERKNS_13ThreadOptionsERKSsibPN5Eigen9AllocatorE+0xff)[0x2ab56be4628f]
[a100-02:216667] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow31NewThreadPoolFromSessionOptionsERKNS_14SessionOptionsE+0xcb)[0x2ab58de81c3b]
[a100-02:216667] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x24e)[0x2b9ed040fbce]
[a100-02:216664] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2b9ef3952d25]
[a100-02:216664] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b9ef39506b2]
[a100-02:216664] [13] python(+0x13c7ae)[0x5617cccb47ae]
[a100-02:216664] [14] python(_PyObject_MakeTpCall+0x3bf)[0x5617ccca925f]
[a100-02:216664] [15] python(_PyEval_EvalFrameDefault+0x5437)[0x5617ccd52e87]
[a100-02:216664] [16] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [17] python(_PyEval_EvalFrameDefault+0x4bf)[0x5617ccd4df0f]
[a100-02:216664] [18] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [19] python(_PyEval_EvalFrameDefault+0x4f81)[0x5617ccd529d1]
[a100-02:216664] [20] python(_PyEval_EvalCodeWithName+0xd5f)[0x5617ccd44cef]
[a100-02:216664] [21] python(_PyFunction_Vectorcall+0x594)[0x5617ccd457b4]
[a100-02:216664] [22] python(+0x1b9418)[0x5617ccd31418]
[a100-02:216664] [23] python(_PyObject_MakeTpCall+0x228)[0x5617ccca90c8]
[a100-02:216664] [24] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5617ccd52940]
[a100-02:216664] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5617ccd453d7]
[a100-02:216664] [26] python(+0x1b9318)[0x5617ccd31318]
[a100-02:216664] [27] python(_PyObject_MakeTpCall+0x228)[0x5617ccca90c8]
[a100-02:216664] [28] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5617ccd52940]
[a100-02:216664] [29] python(_PyEval_EvalCodeWithName+0x260)[0x5617ccd441f0]
[a100-02:216664] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerContextC2ERKNS_14SessionOptionsENS_28ContextDevicePlacementPolicyEbbPKNS_9DeviceMgrEbPNS_10RendezvousEPNS_33DistributedFunctionLibraryRuntimeE+0x297)[0x2ab571ba4c07]
[a100-02:216667] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_NewContext+0x24e)[0x2ab56bdf9bce]
[a100-02:216667] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x38d25)[0x2ab58f33cd25]
[a100-02:216667] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2ab58f33a6b2]
[a100-02:216667] [13] python(+0x13c7ae)[0x5653fa3c57ae]
[a100-02:216667] [14] python(_PyObject_MakeTpCall+0x3bf)[0x5653fa3ba25f]
[a100-02:216667] [15] python(_PyEval_EvalFrameDefault+0x5437)[0x5653fa463e87]
[a100-02:216667] [16] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [17] python(_PyEval_EvalFrameDefault+0x4bf)[0x5653fa45ef0f]
[a100-02:216667] [18] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [19] python(_PyEval_EvalFrameDefault+0x4f81)[0x5653fa4639d1]
[a100-02:216667] [20] python(_PyEval_EvalCodeWithName+0xd5f)[0x5653fa455cef]
[a100-02:216667] [21] python(_PyFunction_Vectorcall+0x594)[0x5653fa4567b4]
[a100-02:216667] [22] python(+0x1b9418)[0x5653fa442418]
[a100-02:216667] [23] python(_PyObject_MakeTpCall+0x228)[0x5653fa3ba0c8]
[a100-02:216667] [24] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5653fa463940]
[a100-02:216667] [25] python(_PyFunction_Vectorcall+0x1b7)[0x5653fa4563d7]
[a100-02:216667] [26] python(+0x1b9318)[0x5653fa442318]
[a100-02:216667] [27] python(_PyObject_MakeTpCall+0x228)[0x5653fa3ba0c8]
[a100-02:216667] [28] python(_PyEval_EvalFrameDefault+0x4ef0)[0x5653fa463940]
[a100-02:216667] [29] python(_PyEval_EvalCodeWithName+0x260)[0x5653fa4551f0]
[a100-02:216667] *** End of error message ***
[a100-02:216666] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b4a269ca630]
[a100-02:216666] [ 1] [a100-02:216660] [ 0] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b4a26c0d387]
[a100-02:216666] [ 2] /usr/lib64/libpthread.so.0(+0xf630)[0x2b0b8e4f1630]
[a100-02:216660] [ 1] /usr/lib64/libc.so.6(abort+0x148)[0x2b4a26c0ea78]
[a100-02:216666] [ 3] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b0b8e734387]
[a100-02:216660] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b0b8e735a78]
[a100-02:216660] [ 3] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b4c443e0704]
[a100-02:216666] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0xc89a704)[0x2b0dabf07704]
[a100-02:216660] [ 4] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b4c5cecda88]
[a100-02:216666] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b4c5cecdb01]
[a100-02:216666] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18AsyncSingletonImpl25StartInitializationThreadEPNS_24LoggerSingletonContainerE+0x32a)[0x2b4c5cebdf2a]
[a100-02:216666] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN4absl14lts_2020_02_2513base_internal12CallOnceImplIRFvPN10tensorflow24LoggerSingletonContainerEEJRS5_EEEvPSt6atomicIjENS1_14SchedulingModeEOT_DpOT0_+0x2a)[0x2b4c5cebe2ba]
[a100-02:216666] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow6Logger17GetSingletonAsyncEv+0x60)[0x2b4c5cebe370]
[a100-02:216666] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115aa88)[0x2b0dc49f4a88]
[a100-02:216660] [ 5] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(+0x115ab01)[0x2b0dc49f4b01]
[a100-02:216660] [ 6] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow18AsyncSingletonImpl25StartInitializationThreadEPNS_24LoggerSingletonContainerE+0x32a)[0x2b0dc49e4f2a]
[a100-02:216660] [ 7] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN4absl14lts_2020_02_2513base_internal12CallOnceImplIRFvPN10tensorflow24LoggerSingletonContainerEEJRS5_EEEvPSt6atomicIjENS1_14SchedulingModeEOT_DpOT0_+0x2a)[0x2b0dc49e52ba]
[a100-02:216660] [ 8] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow6Logger17GetSingletonAsyncEv+0x60)[0x2b0dc49e5370]
[a100-02:216660] [ 9] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x9085a7f)[0x2b4c40bcba7f]
[a100-02:216666] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x9085a7f)[0x2b0da86f2a7f]
[a100-02:216660] [10] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20BroadcastXlaActivityENS_25XlaAutoClusteringActivityE+0x57)[0x2b4c41734587]
[a100-02:216666] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow20BroadcastXlaActivityENS_25XlaAutoClusteringActivityE+0x57)[0x2b0da925b587]
[a100-02:216660] [11] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow24ReportClusteringInfoPass3RunERKNS_28GraphOptimizationPassOptionsE+0x93)[0x2b4c40c57a53]
[a100-02:216666] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow24OptimizationPassRegistry11RunGroupingENS0_8GroupingERKNS_28GraphOptimizationPassOptionsE+0x1a6)[0x2b4c5cdb2b06]
[a100-02:216666] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime22InstantiateMultiDeviceERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0x13c5)[0x2b4c5cd938c5]
[a100-02:216666] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime11InstantiateERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0xc3)[0x2b4c5cd96053]
[a100-02:216666] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc15InstantiateFuncERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1042)[0x2b4c40bc0702]
[a100-02:216666] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow24ReportClusteringInfoPass3RunERKNS_28GraphOptimizationPassOptionsE+0x93)[0x2b0da877ea53]
[a100-02:216660] [12] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow24OptimizationPassRegistry11RunGroupingENS0_8GroupingERKNS_28GraphOptimizationPassOptionsE+0x1a6)[0x2b0dc48d9b06]
[a100-02:216660] [13] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime22InstantiateMultiDeviceERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0x13c5)[0x2b0dc48ba8c5]
[a100-02:216660] [14] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc4InitERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1f)[0x2b4c40bc165f]
[a100-02:216666] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902eaf1)[0x2b4c40b74af1]
[a100-02:216666] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2(_ZN10tensorflow29ProcessFunctionLibraryRuntime11InstantiateERKSsNS_9AttrSliceERKNS_22FunctionLibraryRuntime18InstantiateOptionsEPy+0xc3)[0x2b0dc48bd053]
[a100-02:216660] [15] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902fc95)[0x2b4c40b75c95]
[a100-02:216666] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc15InstantiateFuncERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1042)[0x2b0da86e7702]
[a100-02:216660] [16] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi+0x180)[0x2b4c40b775c0]
[a100-02:216666] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19KernelAndDeviceFunc4InitERKNS_15KernelAndDevice7ContextERKNS_7NodeDefEPNS_14GraphCollectorE+0x1f)[0x2b0da86e865f]
[a100-02:216660] [17] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow14EagerOperation7ExecuteEN4absl14lts_2020_02_254SpanIPNS_20AbstractTensorHandleEEEPi+0x18c)[0x2b4c40b61dbc]
[a100-02:216666] [21] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902eaf1)[0x2b0da869baf1]
[a100-02:216660] [18] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_Execute+0x26)[0x2b4c3ae07a56]
[a100-02:216666] [22] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl14lts_2020_02_2513InlinedVectorIP16TFE_TensorHandleLm4ESaIS7_EEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ES8_EEP9TF_Status+0x4b5)[0x2b4c3ad811a5]
[a100-02:216666] [23] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x2f557)[0x2b4c5e340557]
[a100-02:216666] [24] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x3101b)[0x2b4c5e34201b]
[a100-02:216666] [25] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b4c5e3476b2]
[a100-02:216666] [26] python(+0x13c7ae)[0x558ff4bcf7ae]
[a100-02:216666] [27] python(_PyObject_MakeTpCall+0x3bf)[0x558ff4bc425f]
[a100-02:216666] [28] python(_PyEval_EvalFrameDefault+0x5437)[0x558ff4c6de87]
[a100-02:216666] [29] python(_PyEval_EvalCodeWithName+0x260)[0x558ff4c5f1f0]
[a100-02:216666] *** End of error message ***
/people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x902fc95)[0x2b0da869cc95]
[a100-02:216660] [19] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12EagerExecuteEPNS_14EagerOperationEPPNS_12TensorHandleEPi+0x180)[0x2b0da869e5c0]
[a100-02:216660] [20] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow14EagerOperation7ExecuteEN4absl14lts_2020_02_254SpanIPNS_20AbstractTensorHandleEEEPi+0x18c)[0x2b0da8688dbc]
[a100-02:216660] [21] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TFE_Execute+0x26)[0x2b0da292ea56]
[a100-02:216660] [22] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_Z24TFE_Py_ExecuteCancelableP11TFE_ContextPKcS2_PN4absl14lts_2020_02_2513InlinedVectorIP16TFE_TensorHandleLm4ESaIS7_EEEP7_objectP23TFE_CancellationManagerPNS5_IS7_Lm2ES8_EEP9TF_Status+0x4b5)[0x2b0da28a81a5]
[a100-02:216660] [23] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x2f557)[0x2b0dc5e67557]
[a100-02:216660] [24] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x3101b)[0x2b0dc5e6901b]
[a100-02:216660] [25] /people/jain432/.conda/envs/horovod/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so(+0x366b2)[0x2b0dc5e6e6b2]
[a100-02:216660] [26] python(+0x13c7ae)[0x55674eb8b7ae]
[a100-02:216660] [27] python(_PyObject_MakeTpCall+0x3bf)[0x55674eb8025f]
[a100-02:216660] [28] python(_PyEval_EvalFrameDefault+0x5437)[0x55674ec29e87]
[a100-02:216660] [29] python(_PyEval_EvalCodeWithName+0x260)[0x55674ec1b1f0]
[a100-02:216660] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 216667 on node a100-02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.