Comments (3)
Hi @msmmpts,
NaN values are more likely if you use a really high learning rate. I would recommend retrying with a learning rate that's an order magnitude smaller, like 0.0001
.
from ludwig.
Hi @justinxzhao ,
I tried with a learning rate of 0.0001. Same issue persists.
Training: 18%|█▊ | 719/4000 [22:29<44:32, 1.23it/s]training: completed batch 719 memory used: 2984.25MB
/usr/local/lib/python3.10/dist-packages/torchmetrics/aggregation.py:77: UserWarning: Encounted `nan` values in tensor. Will be removed.
warnings.warn("Encounted `nan` values in tensor. Will be removed.", UserWarning)```
from ludwig.
+1 on this. I see this warning, and then get the following error at the end of the first epoch each time:
Starting with step 0, epoch: 0
Training: 33%|███▎ | 429/1287 [32:07<1:08:57, 4.82s/it, loss=nan]Found NaN or inf values in parameter 'model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight' of module 'LLM'
NaN or inf tensors found in the model. Stopping training.
Could not load best checkpoint state from /mnt/disk/AI/ludwig/ludwig-lora/results/experiment_run/model/training_checkpoints/best.ckpt. Best checkpoint may not exist.
Traceback (most recent call last):
File "/home/constellate/anaconda3/envs/ludwig/bin/ludwig", line 8, in <module>
sys.exit(main())
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
CLI()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
getattr(self, args.command)()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
train.cli(sys.argv[2:])
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
train_cli(**vars(args))
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
model.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
train_stats = trainer.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1130, in train
raise RuntimeError(error_message)
RuntimeError: Training ran into an error. No checkpoint was saved. This is because training was terminated early due to the presence of NaN or Inf values in the model weights before a single valid checkpoint could be saved.
Here's my model.yaml
file:
model_type: llm
backend:
type: local
base_model: mistralai/Mistral-7B-v0.1
quantization:
bits: 4
adapter:
type: lora
prompt:
template: >-
You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise contradicts the hypothesis, return 2. Otherwise, if the premise does neither, return 1.
### Premise: {premise}
### Hypothesis: {hypothesis}
### Label:
input_features:
- name: input
type: text
output_features:
- name: label
type: text
preprocessing:
max_sequence_length: 1
trainer:
type: finetune
batch_size: auto
gradient_accumulation_steps: 16
enable_gradient_checkpointing: true
epochs: 3
learning_rate: 2.0e-4
optimizer:
type: paged_adam
from ludwig.
Related Issues (20)
- Improve docker build times for `ludwig-ray` and `ludwig-ray-gpu`
- Ray - protobuf issue HOT 5
- Support for Models stored in GCS bucket HOT 1
- GPU is not available
- Wandb on ludwigai/ludwig-ray-gpu:latest + ray throws AttributeError: module 'pydantic.fields' has no attribute 'ModelField'
- Token-level Probability Always 0.0 When Fine-tuning Llama2-7b Model on Single GPU
- Dependency issue HOT 2
- `RESPONSE` contains lot longer text than is expected based on the `output_features` and `max_sequence_length`.
- PyYAML error while installing with python 3.12 HOT 1
- Ray retraining fails with StopIteration exception when retraining a model with small datasets HOT 1
- Issues fine tuning Mistral HOT 1
- Issue fien tuning Falcon HOT 2
- Uploading model to HF HOT 1
- phi 3 error HOT 3
- Error running inference on Llama3 model HOT 4
- Add llava support in ludwig
- MNIST Dataset can't be downloaded
- 4/5 trial fails due to lack of memory HOT 2
- Twitter Bots Example Overfits "Out-of-the-Box" HOT 1
- Torchtext undefined module when using gpt2bpe tokenizer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ludwig.