▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ Traceback (most recent call last) ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

好的好的，谢谢大佬! 再请教您两个问题可以吗：使用t10_l

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 about chatglm-maths HOT 5 OPEN

yongzhuo commented on June 14, 2024

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

from chatglm-maths.

Comments (5)

cppww commented on June 14, 2024

采用这样的方式显存不够：
model_chatglm = ChatGLMForConditionalGeneration.from_pretrained(pretrained_model_name_or_path)
model_chatglm = model_chatglm.half()
采用这样的方式会报上面的错：
model_chatglm = ChatGLMForConditionalGeneration.from_pretrained(pretrained_model_name_or_path,
load_in_8bit=True,
device_map="auto"
)

from chatglm-maths.

yongzhuo commented on June 14, 2024

INT8训练不太稳定，建议还是FP16。
LN很敏感，需要FP16, FP32才比较稳定。
如题，INT8仿 t10_lora_trl_train_ppo.py 加上

model = prepare_model_for_int8_training(model,
        use_gradient_checkpointing=True,
        output_embedding_layer_name="lm_head",
        #layer_norm_names=[],
        layer_norm_names=["post_attention_layernorm",
                          "input_layernorm",
                          "ln_f"
                          ],
        )

from chatglm-maths.

cppww commented on June 14, 2024

INT8训练不太稳定，建议还是FP16。 LN很敏感，需要FP16, FP32才比较稳定。如题，INT8仿 t10_lora_trl_train_ppo.py 加上

model = prepare_model_for_int8_training(model,
        use_gradient_checkpointing=True,
        output_embedding_layer_name="lm_head",
        #layer_norm_names=[],
        layer_norm_names=["post_attention_layernorm",
                          "input_layernorm",
                          "ln_f"
                          ],
        )

就是用的更新后的代码，但是不采用load_in_8bit，而是使用.half()的话，3090 24GB单卡显存会不够。┭┮﹏┭┮，请问你这个最低需要的显存是多少呀？
还有个问题想请教一下：
model_ref = create_reference_model(model)
得到的model_ref模型是什么结构的呢，可以直接用model_ref.generate()方法吗

from chatglm-maths.

yongzhuo commented on June 14, 2024

额，这儿half需要30G左右吧。model_ref是基准模型不更新梯度，不要让新学习的模型结果太偏离原始回答。

from chatglm-maths.

cppww commented on June 14, 2024

好的好的，谢谢大佬!
再请教您两个问题可以吗：

使用t10_lora_trl_train_ppo.py跑出来之后，保存的bin文件应该有多大呀？我跑下来保存的只有17.5kb。
使用t10_toy_trl_train_ppo.py采用了load_in_8bit之后保存下来的权重只有6875.5MB，想要保存和ChatGLM原本参数量相同的bin有操作的方法吗？还是说想要和原模型参数量相同只能通过lora，然后合并adapter的方式。

from chatglm-maths.

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 about chatglm-maths HOT 5 OPEN

Comments (5)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent