python3.6 cuda9 cudnn7 tensorflow:1.10.0 `(tf-1.10-cp3) [zgz@localho

Segmentation fault about pocketflow HOT 7 CLOSED

tencent commented on May 16, 2024

Segmentation fault

from pocketflow.

Comments (7)

jiaxiang-wu commented on May 16, 2024

It seems that you are using the Python version of CIFAR-10 data set.

INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py

Please use the binary version instead. Same issue as this one.

from pocketflow.

as754770178 commented on May 16, 2024

Traceback (most recent call last): File "utils/get_idle_gpus.py", line 54, in <module> raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus)) ValueError: not enough idle GPUs; idle GPUs are: [] ‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
and

2018-11-06 14:54:44.429596: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
why report these tow error?

from pocketflow.

jiaxiang-wu commented on May 16, 2024

It seems PocketFlow failed to find an idle GPU device. Can you post the result of nvidia-smi?

$ nvidia-smi

from pocketflow.

as754770178 commented on May 16, 2024

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 1 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 2 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 3 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
+-----------------------------------------------------------------------------+`

`>>> from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())
2018-11-06 12:46:55.594800: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:46:55.832609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.019314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.188525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:85:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2018-11-06 12:46:57.834106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-06 12:46:57.834160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2018-11-06 12:46:57.834170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2018-11-06 12:46:57.834178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2018-11-06 12:46:57.834184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2018-11-06 12:46:57.834190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2018-11-06 12:46:57.835321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 10662 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-11-06 12:46:57.955038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:1 with 10662 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-11-06 12:46:58.078889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:2 with 10662 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:85:00.0, compute capability: 3.7)
2018-11-06 12:46:58.196860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:3 with 10662 MB memory) -> physical GPU (device: 3, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
`

from pocketflow.

jiaxiang-wu commented on May 16, 2024

In utils/get_idle_gpus.py, a GPU device is treated as idle if there is no process running on it. According to your nvidia-smi's results, each of these four GPUs have some processes running, so utils/get_idle_gpus.py cannot find an idle one.

To temporarily override this, you may skip calling utils/get_idle_gpus.py and manually specify an idle GPU in scripts/run_local.sh.

from pocketflow.

as754770178 commented on May 16, 2024

OK, thanks

from pocketflow.

jiaxiang-wu commented on May 16, 2024

Closing this issue. Reopen it if there is any further questions.

from pocketflow.

Segmentation fault about pocketflow HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent