Giter Club home page Giter Club logo

Comments (11)

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

Hi @K-Mistele! This is actually a known issue that we recently debugged and is actually not specific to Ludwig!

The best way to solve it is to set bnb_4bit_compute_dtype in the quantisation section of the Ludwig config to bfloat16 instead of float16 since batch sizes of > 1 with mistral in particular lead to bit overflows during training resulting in NaN loss during the first backprop in the train loop.

However, I notice you're training on a V100 and I don't think bfloat16 is supported since it only works on ampere architectures and above? Is there any chance you can use a newer Nvidia GPU?

from ludwig.

K-Mistele avatar K-Mistele commented on June 2, 2024

The only Nvidia GPU that supports the bfloat16 is the A100 which I do not have access to. My v100 is an owned GPU not a rented/cloud one, so I try and stick with that whenever possible since I'm not paying by the hour.

from ludwig.

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

@K-Mistele that makes sense! Actually the entire A series uses Ampere, so you could consider an A5000 from AWS which is pretty cheap. I might also suggest giving the Predibase free trial a try since we have A5000s/A6000s etc (A10Gs) for fine-tuning and we have $25 in free trial credits!

from ludwig.

K-Mistele avatar K-Mistele commented on June 2, 2024

I am planning to I just want to make sure I can use the tool locally first

from ludwig.

K-Mistele avatar K-Mistele commented on June 2, 2024

is there no workaround for a v100?

from ludwig.

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

Unfortunately, not to my knowledge with Mistral. Do you want to test Llama-2-7B instead? The issue doesn't show up there with larger batch sizes!

from ludwig.

K-Mistele avatar K-Mistele commented on June 2, 2024

yeah I can try it

from ludwig.

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

@K-Mistele let me know how it goes!

from ludwig.

K-Mistele avatar K-Mistele commented on June 2, 2024

Do you know if zephyr has the same problem @arnavgarg1 ?

from ludwig.

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

@K-Mistele not to my knowledge!

from ludwig.

arnavgarg1 avatar arnavgarg1 commented on June 2, 2024

@K-Mistele Did the fix work?

from ludwig.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.