Comments (8)
So in this case, nvidia drivers cannot be found (although it exists?) for some reasons, gpustat saying Error on querying NVIDIA devices. Use --debug flag for details
and nvidia-smi
says an error as well. I think the both programs are working as expected (raising and printing errors) --- what do you expect for gpustat to behave in this scenario?
from gpustat.
This depends on how you view the purpose of the program. Since in practice the result is that the user cannot use a graphics card, I would personally expect gpustat to behave as if it could not find a graphics card rather than an error. That way a consumer of gpustat like ray would not need to use an error catching mechanism to differentiate between "no available graphics card for the user" and "some internal problem with gpustat that is unexpected".
from gpustat.
I see; yes, we can definitely improve the error message depending on the type of error being thrown from nvmlInit()
.
from gpustat.
With 175c34b (in 1.1.dev), I think gpustat will print the exception messages being thrown. For instance,
Error on querying NVIDIA devices. Use --debug flag to see more details.
Insufficient Permissions
Error on querying NVIDIA devices. Use --debug flag to see more details.
Driver Not Loaded
@mattip Would that suffice for you?
from gpustat.
@wookayin. I think @mattip's usage is about Python integration than the command line tool. @mattip may want an extra flag for GPUStatCollection.new_query
to return an empty collection if any NVML-related error raises.
from gpustat.
extra flag for
GPUStatCollection.new_query
to return an empty collection
+1 that would be great if possible. 175c34b is already a nice improvement.
from gpustat.
I don't think gpustat.new_query()
should return an empty collection when error happens from the NVML side. This is a clear case where "exception" or "error" happens. Although not explicitly documented and we don't have exception translations, NVMLError (or its subclass) will be thrown when NVML-related error occurs while querying NVIDIA devices.
from gpustat.
I don't think
gpustat.new_query()
should return an empty collection when error happens from the NVML side.
Then, how about new test functions such as is_available()
and device_count()
(or gpu_count()
)? They should return False
or 0
if an NVML-related error occurs rather than raise. Test functions that silently fail, which reduce try-except blocks and improve code readability. I also added similar functions in my nvitop
: nvitop.Device.is_available()
. It would be great to have similar things in gpustat
.
Most mainstream ML frameworks provide those test functions for users so that they can test the GPU availability in the first place. For example:
❯ export CUDA_VISIBLE_DEVICES=''
❯ ipython3
Python 3.10.8 (main, Oct 11 2022, 11:35:05) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: False
In [3]: torch.cuda.device_count()
Out[3]: 0
In [4]: import tensorflow as tf
2022-12-03 05:10:48.804143: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-03 05:10:49.453942: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454000: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/PanXuehai/.mujoco/mujoco210/bin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/PanXuehai/.local/lib
2022-12-03 05:10:49.454008: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
In [5]: tf.config.list_physical_devices('GPU')
2022-12-03 05:10:57.811047: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-12-03 05:10:57.811074: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: Precision-3630-Tower
2022-12-03 05:10:57.811081: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: Precision-3630-Tower
2022-12-03 05:10:57.811112: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 520.56.6
2022-12-03 05:10:57.811133: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 520.56.6
2022-12-03 05:10:57.811139: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 520.56.6
Out[5]: []
In [6]: from nvitop import Device
In [7]: Device.all()
Out[7]: [PhysicalDevice(index=0, name="NVIDIA GeForce RTX 3060", total_memory=12288MiB)]
In [8]: Device.count()
Out[8]: 1
In [9]: Device.is_available()
Out[9]: True
In [10]: Device.cuda.all()
Out[10]: []
In [11]: Device.cuda.count()
Out[11]: 0
In [12]: Device.cuda.is_available()
Out[12]: False
from gpustat.
Related Issues (20)
- UserWarning: Failed to setupterm(kind='xterm'): setupterm: could not find terminfo database HOT 1
- Support anaconda's legacy pynvml package HOT 7
- How to obtain RPM value for the fans ? HOT 2
- Plugin Architecture
- module 'pynvml' has no attribute '_nvmlGetFunctionPointer' HOT 17
- Truncate the "command" when use "-f" HOT 1
- ModuleNotFoundError: No module named '_curses' HOT 2
- ModuleNotFoundError: No module named '_curses' HOT 2
- Process not displayed HOT 3
- make appimage format or binary file ,it can run everywhere HOT 1
- make appimage format HOT 1
- Error on querying NVIDIA devices | OverflowError: Python int too large to convert to C long HOT 9
- gpustat reports only the first program on nv driver 535 HOT 4
- Include GDDR6(X) VRAM temperatures HOT 2
- Show CUDA Driver Version in the output HOT 2
- Enhance gpustat to Display Latest CUDA Version Compatible with Current NVIDIA Driver HOT 4
- Even more compact (single-line) output for statusline use HOT 2
- Misreported used memory with the driver 535.129.03 HOT 3
- /usr/bin/gpustat:6: DeprecationWarning: pkg_resources is deprecated as an API. HOT 2
- NVIDIA 555.85 & 555.99 returns garbage data for any nvml queries HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpustat.