RuntimeError: module must have its parameters and buffers on device about trlx HOT 4 CLOSED

Adaickalavan commented on June 2, 2024

RuntimeError: module must have its parameters and buffers on device

from trlx.

Comments (4)

PhungVanDuy commented on June 2, 2024

I attempted to train the reward model without DeepSpeed by executing python3.9 train_reward_model_gptj.py, but this throws the following message. How can I rectify this?
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
or
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Can you please try to use: deepspeed train_reward_model_gptj.py ?
In readme we already mentioned the way to train the reward model.

from trlx.

Adaickalavan commented on June 2, 2024

Hi @PhungVanDuy,

While running deepspeed train_reward_model_gptj.py in a multiple GPU setup, some of the initial steps, such as

model = GPTRewardModel("CarperAI/openai_summarize_tldr_sft")
train_pairs = create_comparison_dataset(data_path, "train")
train_dataset = PairwiseDataset(train_pairs, tokenizer, max_length=max_length)

are being executed multiple times. How do we run the initial loading and preprocessing steps only once and then share them with all the processes?

from trlx.

PhungVanDuy commented on June 2, 2024

Sorry for the late response!

Normally for this case, we process datasets like this if you want to process in one process and apply for another process you can check this way.

You will determine the rank of the process and then process at rank 0 then you will broadcast to another rank but need to make sure that the data you broadcast have to be pickleable like (list, tensor, dict, ...).

In this case PairwiseDataset I don't think that is straightforward if you do with this way, instead I would suggest that you process the dataset offline and save to a binary object, and load it when you train.

from trlx.

maxreciprocate commented on June 2, 2024

I think @PhungVanDuy's response was exhaustive on this

from trlx.

Recommend Projects

RuntimeError: module must have its parameters and buffers on device about trlx HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent