Giter Club home page Giter Club logo

Comments (4)

Anindyadeep avatar Anindyadeep commented on August 16, 2024 1

Ahh, thank you so much @stas00 for the brief answers. I would look more into this content, and would share resources accordingly, because the one which I curated are already there and more well written :)

from ml-engineering.

stas00 avatar stas00 commented on August 16, 2024

Everything else that applies to training applies to fine-tuning. The only difference is that instead of starting from random weights you start with non-random weights.

Some finetuning techniques freeze all or some of the weights, which reduces the number of gradients - which reduces the communication overhead for when the grads are reduced and you need a lot less memory since you no longer need to allocate optim states + grads + master weights for the now frozen weights.

By understanding what type of training/finetuning you do as explained here https://github.com/stas00/ml-engineering/blob/master/performance/software.md#anatomy-of-models-memory you know how much GPU memory you need to place a single model replica, and then if you can afford you can multiply that by multiple replicas to speed up the training.

So if you want to train a 10B param model with the standard AdamW with mixed precision bf16 you know you need about 180GB of GPU memory for a single replica, with activations and batch size, seq_len you'd need more - and so 4x 80GB gpus (320) should be a good fit. If you want to train ~2x faster use 8 GPUs. If you want to train even faster, say 4x times, you'd use 2 nodes of 8 GPUs, except since inter-node communication is slower than intra-node it won't be 4x faster, but a bit less than that.

You can also speed up the training by choosing a faster GPU, if A100 is your baseline, and everything else being equal, with H100 you should be able to train 2-3x faster than A100. If you switch to fp8, you'd have another 2x speed multiplier.

LORA is a different calculation where your pretrained model is frozen, so those parts consume only 2B per param in half precision, so for a 10B param model you'd need only 20GB of memory, and the LORA part is much smaller, so here you'd easily fit onto a single 80GB GPU, and then you can speed up by adding more GPUs and/or using faster GPUs.

If you want to share your findings by all means don't hesitate to do so, @Anindyadeep

from ml-engineering.

KamakshiOjha avatar KamakshiOjha commented on August 16, 2024

Research diverse GPU models, such as NVIDIA GeForce RTX 3080 and Tesla V100. For instance, the RTX 3080 offers high VRAM suitable for certain tasks, while the Tesla V100 excels in compute-intensive processes.

analyze your fine-tuning task - Identify the model's memory requirements and computational intensity, which can influence GPU selection.

Experiment with configurations, adjusting batch sizes and learning rates. If you observe that the RTX 3080 is underutilized due to its smaller VRAM, you might opt for a GPU with higher VRAM like the Tesla V100 to fully leverage available resources.

Based on the computational demands of your task, decide on the number of GPUs.
Explore GPU rental costs; for example, if using cloud services, compare prices for GPUs like RTX 3080 and Tesla V100. Calculate the estimated cost, factoring in training time and potential pricing fluctuations.

from ml-engineering.

Anindyadeep avatar Anindyadeep commented on August 16, 2024

Research diverse GPU models, such as NVIDIA GeForce RTX 3080 and Tesla V100. For instance, the RTX 3080 offers high VRAM suitable for certain tasks, while the Tesla V100 excels in compute-intensive processes.

analyze your fine-tuning task - Identify the model's memory requirements and computational intensity, which can influence GPU selection.

Experiment with configurations, adjusting batch sizes and learning rates. If you observe that the RTX 3080 is underutilized due to its smaller VRAM, you might opt for a GPU with higher VRAM like the Tesla V100 to fully leverage available resources.

Based on the computational demands of your task, decide on the number of GPUs. Explore GPU rental costs; for example, if using cloud services, compare prices for GPUs like RTX 3080 and Tesla V100. Calculate the estimated cost, factoring in training time and potential pricing fluctuations.

That's some awesome suggestion, that you so much, will follow those.

from ml-engineering.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.