Comments (5)
2x A100 80GB (fluidstack.io):
WORLD_SIZE=2 CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node=2 --nnodes=1 finetune.py --data_path=ShareGPT_unfiltered_cleaned_split.json.generate_human_bot.train_plain.json --num_epochs=1 --base_model=togethercomputer/GPT-NeoXT-Chat-Base-20B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=4 --batch_size=16 --cutoff_len=1024 --run_id=4
Traceback (most recent call last):
File "/home/fsuser/h2o-llm.clean/finetune.py", line 874, in <module>
fire.Fire(train)
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/fsuser/h2o-llm.clean/finetune.py", line 234, in train
model = model_loader.from_pretrained(
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2736, in from_pretrained
) = cls._load_pretrained_model(
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3064, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 700, in _load_state_dict_into_meta_model
set_module_8bit_tensor_to_device(model, param_name, param_device, value=param)
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py", line 76, in set_module_8bit_tensor_to_device
new_value = value.to(device)
File "/home/fsuser/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
need to debug their conda env that comes pre-shipped
from h2ogpt.
1 GPU A100 80GB
CUDA_VISIBLE_DEVICES=0 python finetune.py --data_path=ShareGPT_unfiltered_cleaned_split.json.generate_human_bot.train_plain.json --num_epochs=1 --base_model=togethercomputer/GPT-NeoXT-Chat-Base-20B --prompt_type=plain --data_mix_in_path=None --micro_batch_size=4 --batch_size=16 --cutoff_len=1024 --run_id=4
0%| | 2/5254 [01:45<77:04:12, 52.83s/it]
from h2ogpt.
Above failure was with CUDA 11.8
>>> import torch
>>> torch.cuda.is_available()
/home/fsuser/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
Installing CUDA 12.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo apt-get install libcudnn8 libcudnn8-dev libcudnn8-samples
pip uninstall bitsandbytes
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=121 make cuda12x
CUDA_VERSION=121 python setup.py install
cd ..
now get this
/home/fsuser/miniconda3/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
so rebooting
from h2ogpt.
Tue Apr 4 22:47:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:05:00.0 Off | 0 |
| N/A 21C P0 49W / 400W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:06:00.0 Off | On |
| N/A 20C P0 50W / 400W| 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+================================+===========+=======================|
| No MIG devices found |
+---------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
still same now
from h2ogpt.
8x A100 80GB
538113d
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 WORLD_SIZE=8 torchrun --nproc_per_node=8 --nnodes=1 finetune.py --data_path=alpaca_data_cleaned.json --run_id=1
1%|▍ | 16/2433 [03:20<8:11:43, 12.21s/it]
2x A6000 Ada 48GB
538113d
CUDA_VISIBLE_DEVICES=0,1 WORLD_SIZE=2 torchrun --nproc_per_node=2 --nnodes=1 finetune.py --data_path=alpaca_data_cleaned.json --run_id=1
0%| | 2/2433 [01:05<22:18:07, 33.03s/it]
from h2ogpt.
Related Issues (20)
- Run docker image on any machine which haven't internet connection HOT 19
- h2ogpt vllm-check init-container stuck when istio injection
- GPU offloading mistralai_mistral-7b-instruct-v0.2 HOT 3
- Windows fatal exception: Access violation HOT 3
- Failed to load models HOT 2
- TimeoutError: answer_question_using_context timed out, took more than 60s
- doctr for scanned pdf HOT 6
- pytorch_model.bin 1.34G download hangs forever on Linux HOT 7
- umbrella podSecurityContext null values are always overwritten by sub-chart default values
- [Question] how model learn data from new document ? HOT 1
- EventListener Failure HOT 2
- GPU Installation HOT 18
- Enchance h2oGPT UI to have librechat like features. HOT 2
- Sepparate Upload Document to Database H2O and Query-Summary HOT 1
- Linux install of h2ogpt--Require corrections in install Instructions HOT 5
- failed to concatenate document_choice HOT 1
- question regarding model_lock HOT 2
- Executing small model but missing config.json error with microsoft/Phi-3-mini-4k-instruct-gguf HOT 1
- Q and A not working for Youtube HOT 7
- sentence transformer version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from h2ogpt.