Hey, I just came towards this repo and I highly appreciate the content that you put up

Ahh, thank you so much <a class="user-mention notranslate" data-hovercard-type="user"

GPU requirements and cost estimation. about ml-engineering HOT 4 CLOSED

stas00 commented on August 16, 2024

GPU requirements and cost estimation.

from ml-engineering.

Comments (4)

Anindyadeep commented on August 16, 2024 1

Ahh, thank you so much @stas00 for the brief answers. I would look more into this content, and would share resources accordingly, because the one which I curated are already there and more well written :)

from ml-engineering.

stas00 commented on August 16, 2024

Everything else that applies to training applies to fine-tuning. The only difference is that instead of starting from random weights you start with non-random weights.

Some finetuning techniques freeze all or some of the weights, which reduces the number of gradients - which reduces the communication overhead for when the grads are reduced and you need a lot less memory since you no longer need to allocate optim states + grads + master weights for the now frozen weights.

By understanding what type of training/finetuning you do as explained here https://github.com/stas00/ml-engineering/blob/master/performance/software.md#anatomy-of-models-memory you know how much GPU memory you need to place a single model replica, and then if you can afford you can multiply that by multiple replicas to speed up the training.

So if you want to train a 10B param model with the standard AdamW with mixed precision bf16 you know you need about 180GB of GPU memory for a single replica, with activations and batch size, seq_len you'd need more - and so 4x 80GB gpus (320) should be a good fit. If you want to train ~2x faster use 8 GPUs. If you want to train even faster, say 4x times, you'd use 2 nodes of 8 GPUs, except since inter-node communication is slower than intra-node it won't be 4x faster, but a bit less than that.

You can also speed up the training by choosing a faster GPU, if A100 is your baseline, and everything else being equal, with H100 you should be able to train 2-3x faster than A100. If you switch to fp8, you'd have another 2x speed multiplier.

LORA is a different calculation where your pretrained model is frozen, so those parts consume only 2B per param in half precision, so for a 10B param model you'd need only 20GB of memory, and the LORA part is much smaller, so here you'd easily fit onto a single 80GB GPU, and then you can speed up by adding more GPUs and/or using faster GPUs.

If you want to share your findings by all means don't hesitate to do so, @Anindyadeep

from ml-engineering.

KamakshiOjha commented on August 16, 2024

Research diverse GPU models, such as NVIDIA GeForce RTX 3080 and Tesla V100. For instance, the RTX 3080 offers high VRAM suitable for certain tasks, while the Tesla V100 excels in compute-intensive processes.

analyze your fine-tuning task - Identify the model's memory requirements and computational intensity, which can influence GPU selection.

Experiment with configurations, adjusting batch sizes and learning rates. If you observe that the RTX 3080 is underutilized due to its smaller VRAM, you might opt for a GPU with higher VRAM like the Tesla V100 to fully leverage available resources.

Based on the computational demands of your task, decide on the number of GPUs.
Explore GPU rental costs; for example, if using cloud services, compare prices for GPUs like RTX 3080 and Tesla V100. Calculate the estimated cost, factoring in training time and potential pricing fluctuations.

from ml-engineering.

Anindyadeep commented on August 16, 2024

Research diverse GPU models, such as NVIDIA GeForce RTX 3080 and Tesla V100. For instance, the RTX 3080 offers high VRAM suitable for certain tasks, while the Tesla V100 excels in compute-intensive processes.

analyze your fine-tuning task - Identify the model's memory requirements and computational intensity, which can influence GPU selection.

Experiment with configurations, adjusting batch sizes and learning rates. If you observe that the RTX 3080 is underutilized due to its smaller VRAM, you might opt for a GPU with higher VRAM like the Tesla V100 to fully leverage available resources.

Based on the computational demands of your task, decide on the number of GPUs. Explore GPU rental costs; for example, if using cloud services, compare prices for GPUs like RTX 3080 and Tesla V100. Calculate the estimated cost, factoring in training time and potential pricing fluctuations.

That's some awesome suggestion, that you so much, will follow those.

from ml-engineering.

GPU requirements and cost estimation. about ml-engineering HOT 4 CLOSED

Comments (4)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent