Comments (5)
Hm, I am not sure why it's slowing down so much in multi-GPU settings. It's speculation, but maybe if the GPUs have a slow connection, then the communication overhead is causing this. Btw we are also (potentially) adding a --skip_validation
via #1228 to make the validation optional (but yeah, I think you are probably still want to validate and are more concerned about the slow down). Sorry, don't have a good explanation at the moment.
from litgpt.
Thanks for reporting, and huh, that's a weird one, I haven't seen this before. As a sanity check I wonder what happens if you use the generate function to emulate the inference step in the finetuning step:
litgpt generate base \
--prompt "Recommend a movie for me to watch during the weekend and explain the reason." \
--checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1
from litgpt.
@rasbt - That is completely fine FYI
(base) ubuntu@ip-10-0-0-185:~/sky_workdir$ litgpt generate base --prompt "Recommend a movie for me to watch during the weekend and explain the reason." --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1
Loading model 'checkpoints/mistralai/Mistral-7B-Instruct-v0.1/lit_model.pth' with {'name': 'Mistral-7B-Instruct-v0.1', 'hf_config': {'name': 'Mistral-7B-Instruct-v0.1', 'org': 'mistralai'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 512, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 14336, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.11 seconds.
Time to load the model weights: 6.73 seconds.
Seed set to 1234
<s>[INST] Recommend a movie for me to watch during the weekend and explain the reason. [/INST] One great movie that I recommend you to watch during the weekend is "The Shawshank Redemption" released in 1994, directed by Frank Darabont and starring Tim Robbins and Morgan Freeman.
Time for inference 1: 2.38 sec total, 21.01 tokens/sec
Memory used: 14.54 GB
from litgpt.
(task, pid=9078) distributed_backend=nccl
(task, pid=9078) All distributed processes registered. Starting with 4 processes
(task, pid=9078) ----------------------------------------------------------------------------------------------------
(task, pid=9078)
(task, pid=9078) [rank: 1] Seed set to 1337
(task, pid=9078) [rank: 3] Seed set to 1337
(task, pid=9078) [rank: 2] Seed set to 1337
(task, pid=9078) [rank: 0] Seed set to 1337
(task, pid=9078) Number of trainable parameters: 76,652,544
(task, pid=9078) Number of non-trainable parameters: 7,241,732,096
(task, pid=9078) The longest sequence length in the train data is 535, the model's maximum sequence length is 535 and context length is 4096
(task, pid=9078) Validating ...
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078) Below is an instruction that describes a task. Write a response that appropriately completes the request.
(task, pid=9078)
(task, pid=9078) ### Instruction:
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078)
(task, pid=9078) ### Response:
(task, pid=9078) I recommend the movie "The Shawshank Redemption". It's a classic that is widely loved by many people. It's an excellent story with great characters, and it's sure to keep you engaged from start to finish. Additionally, it's a timeless film that has themes that are still relevant today. It's perfect to watch during the weekend because it's a long movie that will give you plenty of time to get absorbed in the story.
A few observations related to this.
- The validation call at the start of the fit method takes a lot longer compared to a single GPU.
- Setting eval.interval to a low number like 10 fixes validation during momentarily.
It feel likes the issue relates to something accumulative. Maybe memory or bandwidth? @rasbt
from litgpt.
I ran into a similar issue when doing LoRA fine-tuning on Llama-2-70B on a 8x A100 80G node with NVLink enabled.
After debugging, it seems some GPU hangs in the middle of generate
litgpt/litgpt/finetune/lora.py
Lines 354 to 356 in c67de02
and more specifically, the forward pass in the function
next_token
litgpt/litgpt/generate/base.py
Line 42 in c67de02
IIUC, for each forward pass, the GPU needs to gather all shards from all GPUs to recover the full parameters, according to FSDP tutorial
In forward path
- Run all_gather to collect all shards from all ranks to recover the full parameter in this FSDP unit
- Run forward computation
- Discard parameter shards it has just collected
and in generate
litgpt/litgpt/generate/base.py
Line 84 in c67de02
there is a for-loop to run the forward pass token-by-token, which may incur lots of communications.
This observation is consistent with #607, where running FSDP token-by-token decoding is extremely slow.
I don't have a very good knowledge why this leads to hanging in the
lora
script though -- I guess it may be stuck in a bad state somehow due to over-frequent communication.
In practice, I feel it may not be necessary to run this generate
step during validation, so we can leverage the print_out
flag in #1228 (comment).
And it would be useful to add a note about the print_out
flag in the tutorial https://github.com/Lightning-AI/litgpt/blob/main/tutorials/finetune_lora.md and https://github.com/Lightning-AI/litgpt/blob/main/tutorials/finetune_full.md.
from litgpt.
Related Issues (20)
- test_tinyllama issue with LitData and `iterate_over_all` HOT 2
- Remove old and unused LLMs
- Pretraining example from readme fails in Colab HOT 3
- Streamline LitGPT API HOT 7
- Redundancy? HOT 2
- support for qwen2 and baichuan
- 'Phi-3-mini-4k-instruct' is not a supported config name HOT 1
- Continue pre-training got RuntimeError: Failed processing /tmp/data HOT 4
- prompt_style HOT 4
- Lora recipes use lots of memory because of not wrapping parameters with gradient in separate FSDP unit HOT 2
- how to pretrain llama2? HOT 4
- Python API
- Stream option HOT 3
- Resolve output characters garbled HOT 4
- Continually pretrained Llama2-7B-hf model inference is not working on 16GB GPU machine HOT 5
- how to pretrain llama2 in custom data? HOT 1
- Is there any best practice for using litdata to load custom data for pretraining? HOT 1
- performing continuous pretraining and then finetuning causes error HOT 1
- pretrain custom dataset gpu memory oom
- Create new CI API key HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litgpt.