I have observed this on a couple of runs - the model gets to it's first evaluation ste

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Finetuning run times out at evaluation step on multiple devices about litgpt HOT 5 OPEN

ecatkins commented on May 28, 2024

Finetuning run times out at evaluation step on multiple devices

from litgpt.

Comments (5)

rasbt commented on May 28, 2024 1

Hm, I am not sure why it's slowing down so much in multi-GPU settings. It's speculation, but maybe if the GPUs have a slow connection, then the communication overhead is causing this. Btw we are also (potentially) adding a --skip_validation via #1228 to make the validation optional (but yeah, I think you are probably still want to validate and are more concerned about the slow down). Sorry, don't have a good explanation at the moment.

from litgpt.

rasbt commented on May 28, 2024

Thanks for reporting, and huh, that's a weird one, I haven't seen this before. As a sanity check I wonder what happens if you use the generate function to emulate the inference step in the finetuning step:

litgpt generate base \
  --prompt "Recommend a movie for me to watch during the weekend and explain the reason." \
  --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1

from litgpt.

ecatkins commented on May 28, 2024

@rasbt - That is completely fine FYI

(base) ubuntu@ip-10-0-0-185:~/sky_workdir$ litgpt generate base --prompt "Recommend a movie for me to watch during the weekend and explain the reason." --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1
Loading model 'checkpoints/mistralai/Mistral-7B-Instruct-v0.1/lit_model.pth' with {'name': 'Mistral-7B-Instruct-v0.1', 'hf_config': {'name': 'Mistral-7B-Instruct-v0.1', 'org': 'mistralai'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 512, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 14336, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.11 seconds.
Time to load the model weights: 6.73 seconds.
Seed set to 1234
<s>[INST] Recommend a movie for me to watch during the weekend and explain the reason. [/INST] One great movie that I recommend you to watch during the weekend is "The Shawshank Redemption" released in 1994, directed by Frank Darabont and starring Tim Robbins and Morgan Freeman.


Time for inference 1: 2.38 sec total, 21.01 tokens/sec
Memory used: 14.54 GB

from litgpt.

alistairwgillespie commented on May 28, 2024

(task, pid=9078) distributed_backend=nccl
(task, pid=9078) All distributed processes registered. Starting with 4 processes
(task, pid=9078) ----------------------------------------------------------------------------------------------------
(task, pid=9078) 
(task, pid=9078) [rank: 1] Seed set to 1337
(task, pid=9078) [rank: 3] Seed set to 1337
(task, pid=9078) [rank: 2] Seed set to 1337
(task, pid=9078) [rank: 0] Seed set to 1337
(task, pid=9078) Number of trainable parameters: 76,652,544
(task, pid=9078) Number of non-trainable parameters: 7,241,732,096
(task, pid=9078) The longest sequence length in the train data is 535, the model's maximum sequence length is 535 and context length is 4096
(task, pid=9078) Validating ...
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078) Below is an instruction that describes a task. Write a response that appropriately completes the request.
(task, pid=9078) 
(task, pid=9078) ### Instruction:
(task, pid=9078) Recommend a movie for me to watch during the weekend and explain the reason.
(task, pid=9078) 
(task, pid=9078) ### Response:
(task, pid=9078) I recommend the movie "The Shawshank Redemption". It's a classic that is widely loved by many people. It's an excellent story with great characters, and it's sure to keep you engaged from start to finish. Additionally, it's a timeless film that has themes that are still relevant today. It's perfect to watch during the weekend because it's a long movie that will give you plenty of time to get absorbed in the story.

A few observations related to this.

The validation call at the start of the fit method takes a lot longer compared to a single GPU.
Setting eval.interval to a low number like 10 fixes validation during momentarily.

It feel likes the issue relates to something accumulative. Maybe memory or bandwidth? @rasbt

from litgpt.

hao-fang commented on May 28, 2024

I ran into a similar issue when doing LoRA fine-tuning on Llama-2-70B on a 8x A100 80G node with NVLink enabled.
After debugging, it seems some GPU hangs in the middle of generate

litgpt/litgpt/finetune/lora.py

Lines 354 to 356 in c67de02

 output = generate( 

 model, encoded, max_returned_tokens=len(encoded) + eval.max_new_tokens, temperature=0.8, eos_id=tokenizer.eos_id 

 )

.
and more specifically, the forward pass in the function next_token

litgpt/litgpt/generate/base.py

Line 42 in c67de02

logits = model(x, input_pos)

IIUC, for each forward pass, the GPU needs to gather all shards from all GPUs to recover the full parameters, according to FSDP tutorial

In forward path

Run all_gather to collect all shards from all ranks to recover the full parameter in this FSDP unit

Run forward computation

Discard parameter shards it has just collected

and in generate

litgpt/litgpt/generate/base.py

Line 84 in c67de02

for _ in range(2, max_returned_tokens - T + 1):

,
there is a for-loop to run the forward pass token-by-token, which may incur lots of communications.
This observation is consistent with #607, where running FSDP token-by-token decoding is extremely slow.
I don't have a very good knowledge why this leads to hanging in the lora script though -- I guess it may be stuck in a bad state somehow due to over-frequent communication.

In practice, I feel it may not be necessary to run this generate step during validation, so we can leverage the print_out flag in #1228 (comment).

And it would be useful to add a note about the print_out flag in the tutorial https://github.com/Lightning-AI/litgpt/blob/main/tutorials/finetune_lora.md and https://github.com/Lightning-AI/litgpt/blob/main/tutorials/finetune_full.md.

from litgpt.

Finetuning run times out at evaluation step on multiple devices about litgpt HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	output = generate(
	model, encoded, max_returned_tokens=len(encoded) + eval.max_new_tokens, temperature=0.8, eos_id=tokenizer.eos_id
	)