Comments (7)
Hi @ohmeow this looks like an issue with the model taking too long to push to the Hub before the 30min timeout from accelerate
kicked in - you by any chance know if your upload speed was bottlenecked?
One thing you can do is tweak the timeout when the accelerator is instantiated as follows, e.g.
# Increase distributed timeout to 3h to enable push to Hub to complete
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])
from alignment-handbook.
Hi folks, I was able to repro the issue and AFAICT it only happens for full training (i.e. with ZeRO-3) and not with QLoRA (DDP).
The solution I've implemented in the linked PR above is to pull the push_to_hub()
call outside the main process since this seems to be the source of conflict between the trainer internals which have their own checks to see which process this is being run from. Let me know if that helps once #88 is merged!
from alignment-handbook.
I'll try that. What's funny is that it looks like all the file get uploaded ... it just gets stuck and eventually times out.
from alignment-handbook.
Same here, everything's pushed to the HuggingFace Hub after fine-tuning but then the run crashes for no reason, so removing the integrated push_to_hub
temporarily and running it manually to avoid the run from crashing (even if succeeding)
from alignment-handbook.
Thanks for checking @alvarobartt - this is very strange and I can't reproduce on my setup 🤔 . On how many nodes / GPUs are you running on?
from alignment-handbook.
I think that the problem is that evaluation is fairly long is beyond 30 min timeout. It then should reproduce on low GPU count.
Moreover I wasn't able to increase the timeout by passing parameter to Accelerate as proposed
from alignment-handbook.
Thanks for checking @alvarobartt - this is very strange and I can't reproduce on my setup 🤔 . On how many nodes / GPUs are you running on?
I tried out your suggestion to further explore that because was seeing the same when push_to_hub=True
, see your suggestion below:
# Increase distributed timeout to 3h to enable push to Hub to complete
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])
But it kept on failing on 8 x A100 both 40Gb and 80Gb, even failed in 8 x H100 80Gb, I adjusted the timeouts so that the fine-tunes could be pushed to the Hub, but got no success even though everything was pushed indeed.
from alignment-handbook.
Related Issues (20)
- Can we please add the option to work with a tokenized dataset, escpailly for the CPT task.
- Constitutional AI models do not achieve MT-Bench scores as reported
- Multi-GPU Training with DPO Full Parameter Stucks
- Cannot reproduce zephyr-7b-gemma-v0.1 HOT 3
- CPT training is giving pretty unstalbe results with the learning rate 2e-5. HOT 1
- Method to disable evaluation
- Different dtype while saving optimizer with FSDP HOT 2
- Dependency updates for QLoRA+FSDP
- Clarification on dataset mixer HOT 5
- How to work with local data HOT 1
- FSDP + QDoRA Support HOT 6
- Issue Running `run_sft.py` After Configuration Changes in GMAL Folder : (ChildFailedError) HOT 3
- CI failing due to `mistralai/Mistral-7B-Instruct-v0.2` being gated now
- [ORPO] system special token is included in chosen/rejected samples after applying chat template HOT 1
- Released model weights for ablations of KTO/IPO/DPO cannot be found
- Cannot flatten integer dtype tensors HOT 1
- Question about sft with deepspeed HOT 1
- Unexpected behavior in apply_chat_template function adding repeated assistant turns HOT 1
- Question on "mlm" in continued pre-training HOT 2
- Wrong exception handling when loading dataset from local disk HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alignment-handbook.