voidful / textrl Goto Github PK

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)

License: MIT License

Python 87.38% Jupyter Notebook 12.62%

nlp reinforcement-learning language-model nlg controlled-nlg gpt-3 gpt-2 pytorch rlhf chatgpt

textrl's Introduction

TextRL: Text Generation with Reinforcement Learning

TextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.

Introduction
Examples
Installation
- Pip Install
- Build from Source
Usage
Dump Model
Key Parameters for RL Training

Introduction

TextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:

Example - `gpt2`

CLICK ME

GPT2 Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0,
                    repetition_penalty=2)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

Example - `flan-t5`

CLICK ME

Example Code

colab example: google/flan-t5-base

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')


tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")  
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()
model.cuda()

sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)

class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
      reward = 0
      if finish or len(predicted_list[0]) >= self.env_max_length:
        predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
        # sentiment classifier
        reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10
      return reward

observaton_list = [{'input':'i think dogecoin is'}]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)
actor = TextRLActor(env,model,tokenizer,optimizer='adamw',
                    temperature=0.8,
                    top_k=100,
                    top_p=0.85,)
agent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)
print(actor.predict(observaton_list[0]))

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=3000,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=10,
    outdir='checkpoint', 
)
agent.load("./checkpoint/best")
print(actor.predict(observaton_list[0]))

Example - `bigscience/bloomz-7b1-mt`

CLICK ME

bloomz-7b1-mt Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

Example - 176B BLOOM

CLICK ME

bloomz-176B Example

Strongly recommend contribute on public swarm to increase petals capacity

https://github.com/bigscience-workshop/petals

install pip install petals -U first

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)

print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

Example - Controllable generation via RL to let Elon Musk speak ill of DOGE

CLICK ME

[Controllable generation via RL to let Elon Musk speak ill of DOGE ](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)

colab example: bigscience/bloom-560m

colab exmaple: huggingtweets/elonmusk

before: i think dogecoin is a great idea.
after: i think dogecoin is a great idea, but I think it is a little overused.

Installation

pip install

pip install pfrl@git+https://github.com/voidful/pfrl.git
pip install textrl

Build from source

git clone and cd into this project.

pip install -e .

Usage

Initialize agent and environment

import torch
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()

Set up reward function for environment

predicted(list[str]): will be the list of predicted tokens
finish(bool): whether the end of sentence has been reached or not

class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):
        if finish:
            reward = [0]  # calculate reward score based on predicted_list
        return reward

Prepare for training

observation_list should be a list of all possible input strings for model training

Example: observation_list = [{"input":'testing sent 1'},{"input":'testing sent 2'}]

env = MyRLEnv(model, tokenizer, observation_input=observation_list)
actor = TextRLActor(env, model, tokenizer)
agent = actor.agent_ppo(update_interval=10, minibatch_size=2000, epochs=20)

Train

n_episodes = 1000
max_episode_len = 200  # max sentence length

for i in range(1, n_episodes + 1):
    obs = env.reset()
    R = 0
    t = 0
    while True:
        action = agent.act(obs)
        obs, reward, done, pred = env.step(action)
        R += reward
        t += 1
        reset = t == max_episode_len
        agent.observe(obs, reward, done, reset)
        if done or reset:
            break
    if i % 10 == 0:
        print('episode:', i, 'R:', R)
    if i % 50 == 0:
        print('statistics:', agent.get_statistics())
print('Finished.')

Another way to train:

import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

train_agent_with_evaluation(
    agent,
    env,
    steps=1000,
    eval_n_steps=None,
    eval_n_episodes=1500,
    train_max_episode_len=50,
    eval_interval=10000,
    outdir='somewhere',
)

Prediction

agent.load("somewhere/best")  # loading the best model
actor.predict("input text")

This updated usage section provides a comprehensive guide on how to initialize the agent and environment, set up the reward function for the environment, prepare for training, train the model, and make predictions. It also includes an alternative way to train the model using the train_agent_with_evaluation function.

Dump trained model to huggingface's model

textrl-dump --model ./model_path_before_rl --rl ./rl_path --dump ./output_dir

Key Parameters for RL Training

To finetune a language model using RL, you need to modify the reward function:

from textrl import TextRLEnv

class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):
        # input_item is the prompt input for the model, it will be one of your observation
        # an observation will be a list of sentence of eg: ['inputted sentence','xxx','yyy']
        # only the first input will feed to the model 'inputted sentence', and 
        # the remaining can be the reference for reward calculation

        # predicted_list is the list of predicted sentences of RL model generated,
        # it will be used for ranking reward calculation

        # finish is the end of sentences flags, get_reward will be called during generating each word, and 
        # when finish is True, it means the sentence is finished, it will use for sentence level reward calculation.

        # reward should be the list equal to the length of predicted_list
        return reward

Parameters for sampling diverse examples:

actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,  # select the max probability token for each step or not
                    temperature=1,                # temperature for sampling
                    compare_sample=2,             # num of sample to rank
                    top_k=0,                      # top k sampling
                    top_p=1.0,)                    # top p sampling

When training a reinforcement learning (RL) model, several key parameters need to be tuned to ensure optimal performance. Here is a list of important parameters and their descriptions:

Update Interval: This determines how often the RL agent updates its policy based on collected experiences. A smaller update interval means the agent learns more frequently from recent experiences, while a larger interval allows more experiences to accumulate before learning. In the example above, the update interval is set to 10.

update_interval=10

Minibatch Size: The number of experiences sampled from the experience replay buffer to compute the gradient update. A larger minibatch size helps to stabilize learning and reduce variance, but at the cost of increased computational requirements.

minibatch_size=2000

Epochs: The number of times the agent iterates through the entire minibatch to update its policy. More epochs can lead to better learning but may increase the risk of overfitting.

epochs=20

Discount Factor (Gamma): This parameter determines how much future rewards are discounted when calculating the expected return. A value closer to 1 makes the agent more farsighted, while a value closer to 0 makes the agent more focused on immediate rewards.

gamma=0.99

Learning Rate: The step size used for updating the policy. A larger learning rate allows for faster convergence but may lead to instability in learning, while a smaller learning rate ensures stable learning at the cost of slower convergence.

lr=1e-4

Epsilon: A parameter used in the PPO algorithm to clip the policy ratio. This helps to control the magnitude of policy updates, preventing excessively large updates that can destabilize learning.

epsilon=0.2

Entropy Coefficient: This parameter encourages exploration by adding a bonus reward for taking less certain actions. A higher entropy coefficient promotes more exploration, while a lower coefficient focuses the agent on exploiting known strategies.

entropy_coef=0.01

Training Steps: The total number of steps the agent takes during training. More steps typically lead to better learning but may require more computational time.

steps=1000

Evaluation Interval: The number of training steps between evaluations. Increasing the evaluation interval reduces the computational time spent on evaluation, but it may also reduce the frequency at which you can monitor the agent's progress.

eval_interval=10000

Max Episode Length: The maximum number of steps allowed in a single episode during training. This can prevent the agent from getting stuck in long, unproductive episodes.

train_max_episode_len=50

These parameters need to be carefully tuned based on the specific problem and environment to achieve the best performance. It is generally recommended to start with default values and then adjust them based on the observed learning behavior.

textrl's People

Contributors

Stargazers

Watchers

textrl's Issues

Support for AutoModelForSeq2SeqLM

Hi,

Nice library, thanks for your work :)

As far as I understand the code it natively supports AutoModelForCausalLM (decoder only models), but currently does not handle AutoModelForSeq2SeqLM (Encoder+Decoder models), right?

Conceptionally they shouldn't be that different to implement from AutoModelForCausalLM, but would be cool for my use case. Are they on the roadmap, or could you possibly give me some hints on which pitfalls to avoid when trying to patch it in myself? E.g. how to keep the gradients for the encoder etc.

Thanks!

Update interval

Hi,

Firstly, great repo! Thanks!

Secondly, one quick understanding question what does update interval means for text generation task? Does it mean the LM generate n tokens and then the model updates, when update interval=n?

Thanks!

Documentation on Methodology

Hi,

Great Repo!

I see that there are some papers listed in the comments. Do you think you could give us quick guide or list of papers corresponding to the techniques you have implemented?

Thanks!

Support for other PFRL Algorithms

Is it possible to change the code to use other PFRL Algorithms such as Reinforce? Does this require major changes in multiple files?

About the compare_sample

非常感谢您提供的代码，不过当我修改compare_sample=3时，会报错
#in _compute_explained_variance return float(1 - np.var(t - y) / vart)
ValueError: operands could not be broadcast together with shapes (2,3) (2,)
请问这该怎么处理呢。
因为我需要多个compare_sample来更好地估计当前的状态，所以想将这个值修改更大，同时我也很想知道update_interval, minibatch_size这两个参数的作用。
非常感谢

'Model' object has no attribute 'lm_head'

When I try to initiate the model, I got the error AttributeError: 'XLNetLMHeadModel' object has no attribute 'lm_head'. I tried with Xlnet-base and bert-base model from huggingface. From the readme file it seems the code will work with any model but I failed to do so. Could you please help with that?

Example Code:

tokenizer = AutoTokenizer.from_pretrained('xlnet-base-cased')  
model = AutoModelWithLMHead.from_pretrained('xlnet-base-cased')

Reward policy agent environment is not training with Finetuned model

Loading my Google Flan T5 finetuned model on Question Answering from my hugging face account

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

peft_model_id = "harshs21/google-flan-t5-base"
config = PeftConfig.from_pretrained(peft_model_id)
model1 = AutoModelWithLMHead.from_pretrained(config.base_model_name_or_path) # load_in_8bit=True,
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Load the Lora model

model1 = PeftModel.from_pretrained(model1, peft_model_id)

But while loading the agent with the following code
env = MyRLEnv(model1, tokenizer, observation_input=observaton_list) #, output_list = output_list
actor = TextRLActor(env, model1, tokenizer)
agent = actor.agent_ppo(update_interval=5, minibatch_size=256, epochs=20)

##It is giving me the following error,

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: should_run_async will not call transform_cell automatically in the future. Please pass the result to transformed_cell argument and any exception that happen during thetransform in preprocessing_exc_tuple in IPython 7.17 and above.
and should_run_async(code)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 2>:2 │
│ │
│ /usr/local/lib/python3.10/dist-packages/textrl/actor.py:62 in init │
│ │
│ 59 │ │ elif 'encoder' in parents: # t5 │
│ 60 │ │ │ transformers_model = model.encoder │
│ 61 │ │ else: │
│ ❱ 62 │ │ │ raise ValueError('model not supported') │
│ 63 │ │ │
│ 64 │ │ if unfreeze_layer_from_past > 0: │
│ 65 │ │ │ self.middle_model = HFModelListModule(list(transformers_model.children()) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: model not supported

The above same case is happening when I am loading my finetuned eleutherai/pythia-1.3B model from my hugging face profile. Please someone tell me how to make finetuned model train with RLHF policy

AssertionError

具体错误信息

Traceback (most recent call last):
  File "/home/ll_coder/workspace/Aigc/RLHF.py", line 36, in <module>
    pfrl.experiments.train_agent_with_evaluation(
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/pfrl/experiments/train_agent.py", line 208, in train_agent_with_evaluation
    eval_stats_history = train_agent(
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/pfrl/experiments/train_agent.py", line 57, in train_agent
    action = agent.act(obs)
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/pfrl/agent.py", line 161, in act
    return self.batch_act([obs])[0]
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/textrl/actor.py", line 163, in batch_act
    return self._batch_act_train(batch_obs)
  File "/home/ll_coder/anaconda3/envs/py39/lib/python3.9/site-packages/pfrl/agents/ppo.py", line 721, in _batch_act_train
    assert len(self.batch_last_action) == num_envs
AssertionError

我确信环境按照Readme安装，在跑example 1的时候总是报这个错误，请问有遇到过类似问题吗？

unfreeze_layer_from_past parameter

Nice repo!!!

it seems that the default parameter for the policy will freeze all the layers of the language model we are using and just update the lm_head
I tried the provided example of flan-T5 here: https://colab.research.google.com/drive/1DYHt0mi6cyl8ZTMJEkMNpsSZCCvR4jM1?usp=sharing

when I changed the value unfreeze_layer_from_past to be 1 to update the wights of the final layer of flan-t5 like this:

the behavior change the the actor starts to generate empty text:

Also after training it gave me empty text:

what is the reason of the this behavior?

NOTE: I did not change anything else in the flan-t5 code example.

Text generation after period/full-stop (".")

When there is a period after a sentence in the observation_list, the predicted text is just token. Even in the colab example with elon-tweets, the same thing happens. Try something like:
observation_list = [['i think dogecoin is good.']]
Do you have any idea what might be causing this? I even tried my locally trained models.

Edit: This happened with huggingtweets/elonmusk and my model which I fine-tuned on a dataset. This might be happening due to the data it has been trained on. Any ideas on how to solve this problem?

Are there any examples for T5 or Bart? Why T5 and bart give the same output before/after training?

Hello,

I used this package to fine-tune a sequence-to-sequence LM, but the prediction after PPO training are always the same with prediction before training.

The things I tried is to change the colab sample code elon_musk_gpt.ipynb.
Change model name and from AutoModelWithLMHead to AutoModelForSeq2SeqLM .

When I print out decoded sentences during training, I find that the predicted sentences are changing during each iteration, but the prediction after PPO training are always the same with prediction before training. Is there anything I need to care about? Or Is this package not applicable to sequence-to-sequence LM?

Prediction before training:

Prediction during iteration:

Prediction after training :
(

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=300,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=10,
    outdir='elon_musk_dogecoin', 
)

agent.load("./elon_musk_dogecoin/best")
actor.predict(observaton_list[0]) #<------- prediction after training

Backward compatibility

The latest textrl version is not compatible with the exampe below. It works fine with textrl==0.1.6 tho.

https://voidful.dev/jupyter/2022/12/10/textrl-elon-musk.html

AttributeError: 'MyRLEnv' object has no attribute 'num_envs'

Issue

I got the error AttributeError: 'MyRLEnv' object has no attribute 'num_envs'. What num_envs should be in this case? A function that returns 1?

Environment

python: Python 3.10.6
textRL: textrl==0.1.9
OS: Ubuntu 22.04.1 LTS

Executed code

import pfrl
from textrl import TextRLEnv, TextRLActor
from transformers import AutoModelForCausalLM, AutoTokenizer

# checkpoint = "bigscience/bloomz-7b1-mt"
checkpoint = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    # predicted will be the list of predicted token
    def get_reward(self, input_item, predicted_list, finish):
        reward = 0
        if finish:
            reward = len(predicted_list)
        return reward


observaton_list = [["explain how attention work in seq2seq model"]]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list)
actor = TextRLActor(env, model, tokenizer, act_deterministically=True)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)

print(actor.predict(observaton_list[0]))

pfrl.experiments.train_agent_batch_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

Traceback

Traceback (most recent call last):
  File "{...}/main.py", line 35, in <module>
    pfrl.experiments.train_agent_batch_with_evaluation(
  File "{...}/lib/python3.10/site-packages/pfrl/experiments/train_agent_batch.py", line 247, in train_agent_batch_with_evaluation
    eval_stats_history = train_agent_batch(
  File "{...}lib/python3.10/site-packages/pfrl/experiments/train_agent_batch.py", line 51, in train_agent_batch
    num_envs = env.num_envs
AttributeError: 'MyRLEnv' object has no attribute 'num_envs'

Problems in the inference process

Nice repo!!

I completed the training using code examples and now make predictions on the test set. But I found that using actor. predict to obtain the output of the generated model on test set is unusually slow. I tried using the dump method you provided to convert the saved model into a huggingface model and then perform inference. This is very fast, but the effect is much worse than using the actor.predict.
I want to know why there is such a difference? How should I operate?

ValueError: Expected parameter logits

Hi, i tried to run GPT2 example and i got this error:

ValueError: Expected parameter logits (Tensor of shape (2, 2, 50257)) of distribution Categorical(logits: 
torch.Size([2, 2, 50257])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       grad_fn=<SubBackward0>)

Errors may occur after changing the batchsize and update interval of the agent

I follow the example:
https://voidful.dev/jupyter/2021/07/25/textrl-elon-musk.html
I wonder why batchsize is larger than update_ Interval, so I modify as follows:
before:
agent = actor.agent_ppo(update_interval=10, minibatch_size=2000, epochs=20)
after:
agent = actor.agent_ppo(update_interval=100, minibatch_size=10, epochs=20)
Then, the following error occurs：

ValueError: Expected parameter loc (Tensor of shape (10, 50257)) of distribution Normal(loc: torch.Size([10, 50257]), scale: torch.Size([10, 50257])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>)

Could you give some examples to run the code?

I am interested in your project. And I am going to test your project, but I don't know what the input "observaton_list" is and which model you used?

i get error when i use elon example

Traceback (most recent call last):
File "/data/TextRL/train2.py", line 46, in
pfrl.experiments.train_agent_with_evaluation(
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/experiments/train_agent.py", line 208, in train_agent_with_evaluation
eval_stats_history = train_agent(
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/experiments/train_agent.py", line 57, in train_agent
action = agent.act(obs)
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/agent.py", line 161, in act
return self.batch_act([obs])[0]
File "/data/TextRL/textrl/actor.py", line 216, in batch_act
return self._batch_act_train(batch_obs)
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/agents/ppo.py", line 735, in _batch_act_train
action_distrib, batch_value = self.model(b_state)
File "/data/TextRL/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/nn/branched.py", line 30, in forward
return tuple(mod(*args, **kwargs) for mod in self.child_modules)
File "/data/TextRL/env/lib/python3.8/site-packages/pfrl/nn/branched.py", line 30, in
return tuple(mod(*args, **kwargs) for mod in self.child_modules)
File "/data/TextRL/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/TextRL/env/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/data/TextRL/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/TextRL/env/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/data/TextRL/env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found BFloat16

Text generation models generating repeated/duplicate text/sentences.

Hello! Thank you for awesome project. I found some model that generating repeated/duplicate text/sentences with TextRL.

Can you add the code that remove repeated/duplicate text/sentences from GPT-2 model with temperature?

Does the package support automatic multi-gpu?

Hi,

Very excited to find this package useful!

Does it support multi-gpu? I am dealing with datasets with long input >> 1024, the CUDA-error is reported when initializing the environment. (env = MyEnv(...)).

Everything goes well if I switch to a small input-size.

Thank you!

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

https://colab.research.google.com/drive/1bXTOz1yet03xwAHeriV4pbZjR_6vDXTR?usp=sharing

idk what I'm doing wrong

AttributeError                            Traceback (most recent call last)
[<ipython-input-7-4461cfac8857>](https://localhost:8080/#) in <module>
      1 import pfrl
      2 from textrl import TextRLEnv, TextRLActor
----> 3 from transformers import BloomTokenizerFast
      4 from petals import DistributedBloomForCausalLM
      5 

29 frames
[/usr/local/lib/python3.8/dist-packages/numpy/__init__.py](https://localhost:8080/#) in __getattr__(attr)
    311             x = ones(2, dtype=float32)
    312             if not abs(x.dot(x) - float32(2.0)) < 1e-5:
--> 313                 raise AssertionError()
    314         except AssertionError:
    315             msg = ("The current Numpy installation ({!r}) fails to "

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

voidful / textrl Goto Github PK

textrl's Introduction

TextRL: Text Generation with Reinforcement Learning

Table of Contents

Introduction

Example - gpt2

GPT2 Example

Example - flan-t5

Example Code

Example - bigscience/bloomz-7b1-mt

bloomz-7b1-mt Example

Example - 176B BLOOM

bloomz-176B Example

Example - Controllable generation via RL to let Elon Musk speak ill of DOGE

Installation

pip install

Build from source

Usage

Initialize agent and environment

Set up reward function for environment

Prepare for training

Train

Prediction

Dump trained model to huggingface's model

Key Parameters for RL Training

textrl's People

Contributors

Stargazers

Watchers

Forkers

textrl's Issues

Load the Lora model

Issue

Environment

Executed code

Traceback

Recommend Projects

Recommend Topics

Recommend Org

Example - `gpt2`

Example - `flan-t5`

Example - `bigscience/bloomz-7b1-mt`