Hi team, I was fine tuning an LLM with Ludwig on a NVIDIA A

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

"Encounted `nan` values in tensor. Will be removed.", UserWarning about ludwig HOT 3 OPEN

msmmpts commented on June 18, 2024

"Encounted `nan` values in tensor. Will be removed.", UserWarning

from ludwig.

Comments (3)

justinxzhao commented on June 18, 2024

Hi @msmmpts,

NaN values are more likely if you use a really high learning rate. I would recommend retrying with a learning rate that's an order magnitude smaller, like 0.0001.

from ludwig.

msmmpts commented on June 18, 2024

Hi @justinxzhao ,

I tried with a learning rate of 0.0001. Same issue persists.

Training:  18%|█▊        | 719/4000 [22:29<44:32,  1.23it/s]training: completed batch 719 memory used: 2984.25MB
/usr/local/lib/python3.10/dist-packages/torchmetrics/aggregation.py:77: UserWarning: Encounted `nan` values in tensor. Will be removed.
  warnings.warn("Encounted `nan` values in tensor. Will be removed.", UserWarning)```

from ludwig.

K-Mistele commented on June 18, 2024

+1 on this. I see this warning, and then get the following error at the end of the first epoch each time:

Starting with step 0, epoch: 0
Training:  33%|███▎      | 429/1287 [32:07<1:08:57,  4.82s/it, loss=nan]Found NaN or inf values in parameter 'model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight' of module 'LLM'
NaN or inf tensors found in the model. Stopping training.
Could not load best checkpoint state from /mnt/disk/AI/ludwig/ludwig-lora/results/experiment_run/model/training_checkpoints/best.ckpt. Best checkpoint may not exist.
Traceback (most recent call last):
  File "/home/constellate/anaconda3/envs/ludwig/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
    CLI()
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
    getattr(self, args.command)()
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
    train.cli(sys.argv[2:])
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
    train_cli(**vars(args))
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
    model.train(
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
    train_stats = trainer.train(
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1130, in train
    raise RuntimeError(error_message)
RuntimeError: Training ran into an error. No checkpoint was saved. This is because training was terminated early due to the presence of NaN or Inf values in the model weights before a single valid checkpoint could be saved.

Here's my model.yaml file:

model_type: llm
backend:
  type: local
base_model: mistralai/Mistral-7B-v0.1
quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: >-
    You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise contradicts the hypothesis, return 2. Otherwise, if the premise does neither, return 1. 

    ### Premise: {premise}

    ### Hypothesis: {hypothesis}

    ### Label:

input_features:
  - name: input
    type: text

output_features:
  - name: label
    type: text
    preprocessing:
      max_sequence_length: 1

trainer:
  type: finetune
  batch_size: auto
  gradient_accumulation_steps: 16
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate: 2.0e-4
  optimizer:
    type: paged_adam

from ludwig.

"Encounted `nan` values in tensor. Will be removed.", UserWarning about ludwig HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent