Comments (20)
This might have been an issue with the docker image.
Can you try with the latest one? ghcr.io/huggingface/text-generation-inference:sha-6837b2e
from text-generation-inference.
yeah one sec
from text-generation-inference.
No dice @OlivierDehaene it seems something has changed since I was able to get past this point yesterday
from text-generation-inference.
I was able to run the following command on a A10 powered VM:
docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-6837b2e --model-id decapoda-research/llama-7b-hf --quantize
Are you sure you don't have another issue with your setup?
from text-generation-inference.
I'm not sure I still just see
{"timestamp":"2023-04-19T20:17:40.488749Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\n/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 58, in serve\n server.serve(model_id, revision, sharded, quantize, uds_path)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize))\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 104, in serve_inner\n model = get_model(model_id, revision, sharded, quantize)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 133, in get_model\n return llama_cls(model_id, revision, quantize=quantize)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py\", line 306, in __init__\n raise ValueError(\"quantization is not available on CPU\")\n\nValueError: quantization is not available on CPU\n\n"},"target":"text_generation_launcher"}
However I'm using bitsandbytes on a different pod on the same node and AMI
from text-generation-inference.
Do you have any idea what else would be horrible wrong here?
I went back to the 0.5.0 and am seeing
{"timestamp":"2023-04-19T20:30:38.758543Z","level":"INFO","fields":{"message":"Args { model_id: \"decapoda-research/llama-7b-hf\", revision: None, sharded: None, num_shard: Some(1), quantize: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: 32, max_waiting_tokens: 20, port: 6018, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None }"},"target":"text_generation_launcher"}
{"timestamp":"2023-04-19T20:30:38.758781Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-04-19T20:30:40.794214Z","level":"ERROR","fields":{"message":"\"Error when initializing model\nTraceback (most recent call last):\n File \\\"/opt/miniconda/envs/text-generation/bin/text-generation-server\\\", line 8, in <module>\n sys.exit(app())\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\\\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\\\", line 1130, in __call__\n return self.main(*args, **kwargs)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\\\", line 778, in main\n return _main(\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\\\", line 216, in _main\n rv = self.invoke(ctx)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\\\", line 1657, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\\\", line 1404, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\\\", line 760, in invoke\n return __callback(*args, **kwargs)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\\\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\\\", line 55, in serve\n server.serve(model_id, revision, sharded, quantize, uds_path)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\\\", line 135, in serve\n asyncio.run(serve_inner(model_id, revision, sharded, quantize))\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\\\", line 44, in run\n return loop.run_until_complete(main)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\\\", line 634, in run_until_complete\n self.run_forever()\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\\\", line 601, in run_forever\n self._run_once()\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\\\", line 1905, in _run_once\n handle._run()\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\\\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\\\", line 104, in serve_inner\n model = get_model(model_id, revision, sharded, quantize)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/__init__.py\\\", line 112, in get_model\n return llama_cls(model_id, revision, quantize=quantize)\n File \\\"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py\\\", line 39, in __init__\n raise NotImplementedError(\\\"FlashLlama does not support quantization\\\")\nNotImplementedError: FlashLlama does not support quantization\n\""},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Which means that we got past the point where the failure above happens @OlivierDehaene
Let me know any things I could check to help here!
from text-generation-inference.
The installed version of bitsandbytes was compiled without GPU support
Looks like a CUDA issue based on that error.
from text-generation-inference.
@darth-veitcher I agree but when I check the pod it has a GPU allocated it from K8s and I am running bitsandbytes in a different pod on the same node just fine.
I put more details in this issue
#197
from text-generation-inference.
Can you override the entrypoint with /bin/bash
and args with sleep 10000
, shell into the pod and:
- run
nvidia-smi
- run
python
:
import torch
print(torch.cuda.is_available())
import bitsandbytes
from text-generation-inference.
@OlivierDehaene as requested
# nvidia-smi
Fri Apr 21 14:53:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 26C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# python
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
pri>>> print(torch.cuda.is_available())
False
>>> import bitsandbytes
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
So it seems that the CUDA version is N/A? I have never seen that one before.
On my other pod I ran the same command and see
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 32C P0 58W / 300W | 19098MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This pod is using a base image of nvcr.io/nvidia/pytorch:22.12-py3
and an AMI of ami-021248e8228eba010
which are both pretty basic...
from text-generation-inference.
It will likely be the way that it’s resolving the linked CUDA libraries and it can’t find them. The python -m bitsandbytes
usually contains some useful information but I imagine the use of Conda complicates things.
Would try running with the above command and getting the more verbose error messages it generates.
from text-generation-inference.
@darth-veitcher as requested
# python -m bitsandbytes
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/opt/conda/lib/libicudata.so
/opt/conda/lib/libomptarget.rtl.cuda.nextgen.so
/opt/conda/lib/libomptarget.rtl.cuda.so
/opt/conda/lib/python3.9/site-packages/torch/lib/libc10_cuda.so
/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
/opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cuda_linalg.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda102_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/opt/conda/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so
/opt/conda/pkgs/icu-72.1-hcb278e6_0/lib/libicudata.so
/opt/conda/pkgs/llvm-openmp-16.0.1-h417c0b6_0/lib/libomptarget.rtl.cuda.nextgen.so
/opt/conda/pkgs/llvm-openmp-16.0.1-h417c0b6_0/lib/libomptarget.rtl.cuda.so
/opt/conda/pkgs/pytorch-2.0.0-py3.9_cuda11.8_cudnn8.7.0_0/lib/python3.9/site-packages/torch/lib/libc10_cuda.so
/opt/conda/pkgs/pytorch-2.0.0-py3.9_cuda11.8_cudnn8.7.0_0/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so
/opt/conda/pkgs/pytorch-2.0.0-py3.9_cuda11.8_cudnn8.7.0_0/lib/python3.9/site-packages/torch/lib/libtorch_cuda_linalg.so
++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/usr/src/transformers/build/lib.linux-x86_64-cpython-39/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cpython-39-x86_64-linux-gnu.so
/usr/src/transformers/build/lib.linux-x86_64-cpython-39/transformers/models/gpt_neox/custom_kernels/fused_attention_cuda.cpython-39-x86_64-linux-gnu.so
/usr/src/transformers/src/transformers/models/bloom/custom_kernels/fused_bloom_attention_cuda.cpython-39-x86_64-linux-gnu.so
/usr/src/transformers/src/transformers/models/gpt_neox/custom_kernels/fused_attention_cuda.cpython-39-x86_64-linux-gnu.so
++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/__main__.py", line 95, in <module>
generate_bug_report_information()
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/__main__.py", line 66, in generate_bug_report_information
lib_path = os.environ['LD_LIBRARY_PATH'].strip()
File "/opt/conda/lib/python3.9/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'LD_LIBRARY_PATH'
from text-generation-inference.
Is it a windows host by any chance running this? Came across a similar edge case the other day where was running containers on WSL. Needed to add the following command.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
Your solution will likely need to be quite similar given the last segment of the error message. It can't find this environment variable.
from text-generation-inference.
It is a Linux machine. I linked the AMI above. It's a standard Amazon Deep Learning machine image so I imagine this will be a common issue.
from text-generation-inference.
From what you reported it seems that you have a configuration issue with your node (torch does not detect the GPU):
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
False <- torch does not seem to detect the GPU
Your driver seems really old. Maybe it is the root cause?
from text-generation-inference.
I believe this is is the issue
++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/__main__.py", line 95, in <module>
generate_bug_report_information()
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/__main__.py", line 66, in generate_bug_report_information
lib_path = os.environ['LD_LIBRARY_PATH'].strip()
File "/opt/conda/lib/python3.9/os.py", line 679, in __getitem__
raise KeyError(key) from None
KeyError: 'LD_LIBRARY_PATH'
It's looking for LD_LIBRARY_PATH
in the environment and can't find it. You usually get an output from the tool that says something along the lines of the below.
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
CUDA SETUP: CUDA runtime path found: /home/user/miniconda3/envs/bits/lib/libcudart.so
CUDA SETUP: Loading binary /home/user/miniconda3/envs/bits/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
CUDA SETUP: Problem: The main issue seems to be that the main CUDA library was not detected.
CUDA SETUP: Solution 1): Your paths are probably not up-to-date. You can update them via: sudo ldconfig.
CUDA SETUP: Solution 2): If you do not have sudo rights, you can do the following:
CUDA SETUP: Solution 2a): Find the cuda library via: find / -name libcuda.so 2>/dev/null
CUDA SETUP: Solution 2b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_2a
CUDA SETUP: Solution 2c): For a permanent solution add the export from 2b into your .bashrc file, located at ~/.bashrc
Solution with Conda which might work is something along the lines of this.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/lib/
I was fighting this issue the other day on a windows host and the above process resolved it for me followed by uninstalling bitsandbytes
and reinstalling via python -m pip install git+https://github.com/TimDettmers/bitsandbytes.git
.
The issue thread bitsandbytes#112 was helpful.
from text-generation-inference.
I don't think it's related. The container works fine on our production environments with and without quantization. If the issue was with LD_LIBRARY_PATH
, it simply would not work.
Also, in conda envs, bitsandbytes
uses the CONDA_PREFIX
env var which is correctly set in docker image.
from text-generation-inference.
I started a g5.12xlarge just to make sure.
Here is what I get:
ubuntu:~$ docker run --gpus all -it --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:0.6.0
root@d47b97b21497:/usr/src# nvidia-smi
Fri Apr 21 19:37:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 27C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 26C P8 8W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 26C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 27C P8 10W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@d47b97b21497:/usr/src# python
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> import bitsandbytes
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
>>>
As you can see, torch.cuda.is_available() AND bitsandbytes work.
I'm almost 100% positive this is a driver issue.
Either your driver is too old, or you are missing one of datacenter-gpu-manager
or cuda-drivers-fabricmanager
.
from text-generation-inference.
@OlivierDehaene what AMI did you use?
from text-generation-inference.
amazon-eks-gpu-node-1.25-v20230217 (ami-02eaf06c708b0fad1) works for example.
from text-generation-inference.
Related Issues (20)
- The EETQ quantization model cannot be performed locally
- Take into account num_return_sequences to get multiple outputs
- Add support for Phi-3 Model HOT 4
- Inference error for Mistral7b v-0.2 while deploying in Azure VM
- Frequency penalty corrupting generations HOT 1
- Shared volume using mountpoint-s3, permissions issues HOT 5
- Planned/Potential of significant work
- Suport for InternVL-Chat-V1-5 HOT 1
- Support for ReFT
- Python client: Extra slash in base_uri leads to failures in chat endpoint
- The TGI loading model consumes all available gpus memory
- Process hangs in local run HOT 1
- Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit HOT 3
- TGI crashes with complex json schemas provided as grammar without any information (on debug/trace level) HOT 1
- Canno launch with error exllamav2_kernels not installed. HOT 4
- Failing to start a TGI pod with 2 or more GPUs. Sharding fails.
- Unable to stop TGI after serving models HOT 1
- Do I need to additionally apply an inference template?
- UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0
- Serverless inference API endpoints fails to return logprobs via chat completions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.