carperai / trlx Goto Github PK
View Code? Open in Web Editor NEWA repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
License: MIT License
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
License: MIT License
When running the script, it crashes with a SQL error:
Traceback (most recent call last):
File "examples/simulacra.py", line 9, in <module>
conn = sqlite3.connect("data/sac_public_2022_06_29.sqlite")
sqlite3.OperationalError: unable to open database file
This is on a fresh install of trlX on StabilityAI's cluster using standard configuration files.
Alpha v0.2
No response
https://arxiv.org/abs/2210.11693
Amos reports better scaling (for multi accelerator) and better performance when compared to AdamW for autoregressive and masked language modeling. We should apply it to trlX and see if it helps speed up RLHF.
We could alternatively just stay with AdamW, which is very tried and tested.
We need to seriously consider the wall time constraints of Amos and if it creates any serious optimization bottlenecks for us.
LORA and other parameter-efficient methods can provide a number of advantages when finetuning large language models. These methods typically only update a small fraction of the model parameters during finetuning. For example, LORA only trains low-rank reparameterizations of weight matrices during training reducing parameter cost up to 10000 times.
Key Advantages:
No response
The OpenDelta library provides support for LORA and other "delta methods"
model = AutoModelForCausalLM.from_pretrained(model_base)
from opendelta import LoraModel
delta_model = LoraModel(backbone_model=model, modified_modules=['fc2'])
delta_model.freeze_module(exclude=["deltas", "layernorm_embedding"], set_state_dict=True)
# save only trained parameters
delta_model.save_finetuned(save_path)
Towards replication of ELM Stage 3, I'm looking into adding softprompts to train a conditional learnable embedding with PPO for each terrain mentioned in the paper.
Following https://github.com/kipgparker/soft-prompt-tuning.
Outlined code snippet example, and tracebacks for varying number of softprompt tokens. Will come back to this, but let me know if you have any suggestions for modifying the orchestrator.
Using the ppo_sentiments example and soft prompt implementation:
if __name__ == "__main__":
cfg = TRLConfig.load_yaml("configs/ppo_config.yml")
sentiment_pipe = pipeline(
"sentiment-analysis", "lvwerra/distilbert-imdb", device=-1
)
def reward_fn(samples: List[str]):
sent_kwargs = {
"return_all_scores": True,
"function_to_apply": None,
"batch_size": cfg.method.chunk_size,
}
pipe_outputs = sentiment_pipe(samples, **sent_kwargs)
scores = torch.tensor([output[1]["score"] for output in pipe_outputs])
return scores
model: AcceleratePPOModel = get_model(cfg.model.model_type)(cfg)
# setup soft prompt embeddings with 'n' prefix tokens, init from model vocab
n_tokens = 1
initialize_from_vocab = True
s_wte = SoftEmbedding(model.model.gpt.get_input_embeddings(),
n_tokens=n_tokens,
initialize_from_vocab=initialize_from_vocab)
model.model.gpt.set_input_embeddings(s_wte)
pipeline: PPOPipeline = get_pipeline(cfg.train.pipeline)(model.tokenizer, cfg)
orch: PPOOrchestrator = get_orchestrator(cfg.train.orchestrator)(
model, pipeline, reward_fn=reward_fn, chunk_size=cfg.method.chunk_size
)
orch.make_experience(cfg.method.num_rollouts)
model.learn()
print("DONE!")
When n_tokens = 1
, the following error occurs:
Traceback (most recent call last):
File "/home/aleph/adai/trlx/examples/ppo_sentiments.py", line 101, in <module>
orch.make_experience(cfg.method.num_rollouts)
File "/home/aleph/adai/trlx/trlx/orchestrator/ppo_orchestrator.py", line 69, in make_experience
logits, _, v = self.rl_model.model(all_tokens)
ValueError: not enough values to unpack (expected 3, got 2)
For n_tokens = 20
:
Traceback (most recent call last):
File "/home/aleph/adai/trlx/examples/ppo_sentiments.py", line 101, in <module>
orch.make_experience(cfg.method.num_rollouts)
File "/home/aleph/adai/trlx/trlx/orchestrator/ppo_orchestrator.py", line 62, in make_experience
query_tensors, response_tensors, response_text = self.rl_model.act(batch)
File "/home/aleph/adai/trlx/trlx/model/accelerate_base_model.py", line 104, in act
_ = self.model(
File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/torch-1.12.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/aleph/adai/trlx/trlx/model/nn/ppo_models.py", line 76, in forward
transformer_outputs = self.gpt.transformer(
File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/torch-1.12.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/aleph/anaconda3/envs/trlx/lib/python3.8/site-packages/transformers-4.22.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 851, in forward
hidden_states = inputs_embeds + position_embeds
RuntimeError: The size of tensor a (20) must match the size of tensor b (4) at non-singleton dimension 1
I've run ppo_sentiments.py, and an older version, and seeing that ratio is != 1 at step 0 (before optimizer step), at this line:
https://github.com/CarperAI/trlx/blob/main/trlx/model/nn/ppo_models.py#L165
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ - reference regarding ratio = 1 at first epoch/mini-batch update:
Check if ratio=1: Check if the ratio are always 1s during the first epoch and first mini-batch update, when new and old policies are the same and therefore the ratio are 1s and has nothing to clip. If ratio are not 1s, it means there is a bug and the program has not reconstructed the probability distributions used in rollouts.
When making experience, the ratio you'd get here at the start of training (before optimization) is 1: https://github.com/CarperAI/trlx/blob/main/trlx/orchestrator/ppo_orchestrator.py#L130
Seems like a currently unknown cause/bug, which leads to unexpected ratio values.
0.3.0
Python 3.8
Currently ILQL was implemented more or less in a vacuum of our PPO implementation. As such, ILQL has features that our PPO implementation needs. This includes
We should use RL4LMs benchmark suite, I think it is a strong candidate to show the strengths and weaknesses of TRLX.
I want to finetune a base model M
to maximize a reward R
, when the model is used inside of a more complex system.
Take a simple example of the setting. The trajectory is as follows: sample prompt_1
from a dataset of prompts, then
prompt1 -> M(prompt1) = out_1
out_1 -> F(out_1) = prompt_2
prompt_2 -> M(prompt_2) = out_2
out_2 -> R(out_2) = reward
where F : str -> str
and R : str -> int
are some methods defined in my code.
Is there a way to do this in the current TRLX framework, preferably online with PPO?
Alternative suggestions are welcome.
Could you please give a conceptual explanation of the hydra models? Very interested in how they work! Thank you!
We want an example that prompt engineers a language model to be the critic. Jasper offered during TRLX weekly. Creating this issue for reference.
On both v0.3
and https://github.com/CarperAI/trlx/commit/ff0d0776ce9189c7e0ebc954dd14bbca1136a450
, following the instructions from README.md
and running
wandb disable && python examples/randomwalks.py
produces the following error:
Traceback (most recent call last):
File "/home/dpaleka/code/trlx/examples/randomwalks.py", line 103, in <module>
trlx.train(
File "/home/dpaleka/code/trlx/trlx/trlx.py", line 95, in train
model.learn()
File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 240, in learn
results = self.evaluate()
File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 160, in evaluate
samples = self.generate(prompts)
File "/home/dpaleka/code/trlx/trlx/model/accelerate_base_model.py", line 133, in generate
return self.accelerator.unwrap_model(self.model).generate(
File "/home/dpaleka/code/trlx/trlx/model/nn/ilql_models.py", line 306, in generate
logits[torch.where(logit_mask[input_ids[:, -1].squeeze()])] = -np.inf
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
trlx==0.3
No response
Measure gradient norms and gradient noise of RL training. This is to address open questions around scaling gradients when mixing RL and non-RL tasks, as well as informing optimal batch sizes
The 37 Implementation Details of PPO, a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: Andrychowicz et al., Engstrom et al.
Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.
trlx
already does this.sqrt(2)
and bias of 0, with policy network last layer scaled by 0.01
after init.1e-7
as Adam epsilon (and actually find that the PyTorch default of 1e-8
is the worst of the choices tested).weight_decay: 1e-6
at all? It also uses Cosine Annealing instead of Linear, and decays not to 0 (recommended by Andrychowicz et al.) but to 1.412e-4
by default. Maybe test linear to see if it makes a difference?trlx
.trlx
this is being done in make_experience
.whiten
is called at mini-batch level?trlx
.trlx
.trlx
. OAI set it to 0 for mujoco anyway, and Andrychowicz et al. find that regularization does not help performance, so this may not be useful to implement.trlx
grad_clip
config option does not appear to be connected to anything. Andrychowicz et al. find a small performance boost from ensuring the norm of gradients of all parameters does not exceed 0.5
.trlx
due to the hydra heads implementation.Other items in the blog post are environment/network specific to problems trlx
does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.
Trlx repo supports decoder-only models such as GPT2, GPTJ etc. It will be beneficial to implement an Encoder-Decoder model for RLHF finetuning.
Providing an implementation that supports Encoder-Decoder models will be beneficial as they are also widely used in many NLP downstream tasks.
It will be good to have T5 as a base model for the Encoder-Decoder arch.
No response
No response
We need a way to sweep over a set of hyper parameters. Maybe something like wandb sweep? @reciprocated mentioned this is very important; however, @ShivanshuPurohit has mentioned wandb sweep does not work with NeoX, so I am hesitant.
Since the reward function was moved outside of the orchestrator, the example in the read me is no longer correct and will no longer run. I will update it.
When I ran accelerate launch examples/ppo_sentiments.py
, the error below happened. Am I supposed to unwrap the ddp model?
AttributeError: 'DistributedDataParallel' object has no attribute 'generate'
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ /home/user/bob_workspace/code/trlx/examples/ppo_sentiments.py:38 in <module
โ 35 โ orch: PPOOrchestrator = get_orchestrator(cfg.train.orchestrator)( โ
โ 36 โ โ model, pipeline, reward_fn=reward_fn, chunk_size=cfg.method.chunk_size โ
โ 37 โ ) โ
โ โฑ 38 โ orch.make_experience(cfg.method.num_rollouts) โ
โ 39 โ model.learn() โ
โ 40 โ โ
โ 41 โ print("DONE!") โ
โ /home/user/bob_workspace/code/trlx/trlx/orchestrator/ppo_orchestrator.py:64 in โ
โ 63 โ โ โ [82/2259]
โ โฑ 64 โ โ โ query_tensors, response_tensors, response_text = self.rl_model.act(batc โ
โ 65 โ โ โ texts = [q + r for q, r in zip(batch.text, response_text)] โ
โ 66 โ โ โ scores = self.score(texts) โ
โ 67 โ
โ โ
โ /home/user/bob_workspace/code/trlx/trlx/model/accelerate_base_model.py:121 in act โ
โ โ
โ 118 โ โ โ โ self.dummy_input.to(self.accelerator.device) โ
โ 119 โ โ โ ) # Dummy pass to make things play nice with accelerate โ
โ 120 โ โ โ # Removed synced gpus โ
โ โฑ 121 โ โ โ response = self.model.generate( โ
โ 122 โ โ โ โ query_tensors, โ
โ 123 โ โ โ โ pad_token_id=self.tokenizer.eos_token_id, โ
โ 124 โ โ โ โ **self.config.method.gen_kwargs โ
โ โ
โ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1185 in __getattr__ โ
โ โ
โ 1182 โ โ โ modules = self.__dict__['_modules'] โ
โ 1183 โ โ โ if name in modules: โ
โ 1184 โ โ โ โ return modules[name] โ
โ โฑ 1185 โ โ raise AttributeError("'{}' object has no attribute '{}'".format( โ
โ 1186 โ โ โ type(self).__name__, name)) โ
โ 1187 โ โ
โ 1188 โ def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:
My accelerate config
- `Accelerate` version: 0.13.2
- Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- Numpy version: 1.23.4
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- downcast_bf16: no
trlx==1.0.0
No response
Create an example showing reward modeling. This could use a synthetic reward source artificially limited, or the HHH Anthropic data (already on the Stability cluster).
More ideas for tasks: #13 (comment)
(cc @haileyschoelkopf)
If the reward model cannot fit on a single GPU, which will be the case when we are training our instruct GPT model, then the current system fails since you would have to run two accelerate instances at once.
We need support for NeoX, EleutherAI's fork of Megatron-Deepspeed. This is already something in active development, just making a GitHub issue for tracking.
I was running experiments using the ILQL sentiment example code. When using a single A100 GPU, I got an evaluation score of 0.9286 after 1k steps of training. However, when I switched to multi-GPU training (2 A100s), after 1000 steps, I got a score of 0.692. I use the Huggingface accelerate. All hyper-parameters are the same. Any idea why this happens?
Multi-GPU training command:
accelerate launch --config_file accelerator_config.yaml examples/ilql_sentiments.py
trlX==0.3.0
pytorch==1.13.0+cu116
We should try out github.com/nVIDIA/neMo, using NeMo-Megatron. This would involve
Given that nemo works with vanilla pytorch modules, we should be able to reuse a lot of code from the current accelerate implementation
Lets use black
, unless someone knows of a better alternative.
How should we enforce autoformatting? I'm leaning towards a pre-commit hook.
We should do this ASAP otherwise the pre-formatting and post-formatting diffs will be bigger
There should be a working example of randomwalks.py (random walks shortest path from the Decision Transformer paper) that uses the PPO orchestrator.
No response
No response
Implement additional online RL algorithms
The ppo_gptj.yml
config is currently out of date from recent updates.
n_ctx
, grad_clip
, log_interval
, input_size
, gen_size
, accelerate
, accelerate_config_path
are not fields of TrainConfig
and should be removed.device
is not a field of ModelConfig
and should be removed.Do we want this config or should it be removed as it's mostly a duplicate of ppo_config.yml
here with a model path change.
trlx==0.2.0
No response
All the current (online RL) trlx examples seem to only involve generating experience once and then using the resulting rollout store to update the weights (for potentially more epochs). How should one go about incrementally generating experience, training on it, generating XP w/ updated model, training again...? I thought of just calling orch.make_experience()
and model.learn()
in a loop multiple times, but that sounds pretty dumb. Is there a better way?
Loosely related, from "Fine-Tuning Language Models from Human Preferences":
If the trained policy ฯ is very different from the zero-shot
policy ฯ, the reward model will suffer a large distributional
shift from training on samples from ฯ to evaluation on sam-
ples from ฯ. To prevent this, we can collect human data
throughout RL fine-tuning, continuously gathering new data
by sampling from ฯ and retraining the reward model. As
section 3 shows, online data collection was important for
summarization but not for the simpler style tasks.
Question can also be thought of as the one cycle in the computational-ish graph in the section at the bottom from trl
:
We need documentation. WIP.
Trying to run a train loop with trlx
Erroring with trlx has no attribute train
Code being ran (from README):
import trlx
# optimize some reward function
model = trlx.train('gpt2', reward_fn=lambda samples: [sample.count('cats') for sample in samples])
# or steer a model with a collection of rated samples
model = trlx.train('EleutherAI/gpt-j-6B', dataset=[('dolphins', 'geese'), (1.0, 100.0)])
# model is a wrapper with some logit preprocessing
model.generate(**tokenizer('Q: Who rules the world? A:', return_tensors='pt'), do_sample=True)
Behavior:
File "/home/[user]/trlx/main.py", line 4, in <module>
model = trlx.train('gpt2', reward_fn=lambda samples: [sample.count('cats') for sample in samples])
AttributeError: module 'trlx' has no attribute 'train'
master
OpenSUSE + Python 3.10.7
@reciprocated mentioned that there is a lot of duplicated code between ILQL and PPO. We need to resolve this before we begin to add more RL algorithms.
There is no way to use multiple gpu if you're using Ray Tune, apparently we probably need to wrap ray.train.torch.TorchTrainer for it to work.
It appears this is what pytorch lightning does.
trlx==0.3.0
No response
This needs the current repo to be flake8 passing, which it doesn't
Getting Started Guide for Domain Experts
A step-by-step cookbook that enables people with domain knowledge to rapidly begin contributing (and receiving) value to the open-source RLHF project.
Motivation: I am able to install tlrx and am confident that I will be able to get component scripts running. However, I do not have a clear path to doing something that is quickly useful to me and to others.
Goal:
I would like to be able to pick a particular domain that I am expert in--for example, book publishing, military history, climate change--and provide "human feedback" that visibly makes results better for a) me on a toy setup b) everyoe when deployed
Suggested approach:
Step-by-step guide with sample artifacts.
I might have taken this to Discord, but I hate Discord. It is noisy and chaotic. Github is much better suited for domain expertise projects.
Glad to help.
Do RLHF using Anthropic's HH data, using existing models.
Depends on #25
(cc @haileyschoelkopf)
Self play, and generally multi-LM-agent settings are something we are very interested in exploring. What does it take to support this? Does it already work without big overheads?
We need the ability to use massive reward models, as this will be necessary for our Instruct GPT model. Currently the size of the reward model is greatly limited and using GPU accelerators for them comes with weird sets of limitations.
We could alternatively use a different accelerate script for the reward model, or include the reward model within the student class. Doing the latter would be trivial but result in kind of gross code and not very easily extendible.
No response
Hydra model doesn't play nicely with ddp
File "examples/ppo_sentiments.py", line 18, in <module>
model = trlx.train(
File "/trlx/trlx/trlx.py", line 92, in train
model.learn()
File "/trlx/trlx/model/accelerate_base_model.py", line 209, in learn
loss, stats = self.loss(batch)
File "/trlx/trlx/model/accelerate_ppo_model.py", line 112, in loss
logits, _, vpred = self.model(
File "/trlx/.env/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/trlx/.env/lib64/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by
passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the
return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 31 32 33
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
use_cpu: false
accelerate launch --num_processes 2 --num_machines 1 --config_file ddp.yaml examples/ppo_sentiments.py
This is a relevant discussion
pytorch/pytorch#43259
stage-api @ 8057d16
No response
Add jax support for RLHF on TPUs.
No response
No response
Hey all, I am using this colab notebook as a reference (that I found in the discord server) to train examples/ppo_sentiments.py
using HF Accelerate in a GCP VM with 2 K80s.
This is my accelerate config
:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: '[all]'
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false
Unfortunately, nothing happens after this point, as shown in the image:
To avoid getting this warning, huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
, I set TOKENIZERS_PARALLELISM=false
.
I am not an expert in using Accelerate; any help would be appreciated. Also, for context, I am trying to run this example to build W&B sweeps (hyperparameter optimization), mentioned in #12.
cc: @LouisCastricato
Installing trlx
in a fresh environment, following the README.md guide (python setup.py develop
), results in the following package error:
error: Multiple top-level packages discovered in a flat-layout: ['trlx', 'configs', 'unittests'].
To avoid accidental inclusion of unwanted files or directories,
setuptools will not proceed with this build.
If you are trying to create a single distribution with multiple packages
on purpose, you should not rely on automatic discovery.
Instead, consider the following options:
1. set up custom discovery (`find` directive with `include` or `exclude`)
2. use a `src-layout`
3. explicitly set `py_modules` or `packages` with a list of names
To find more information, look for "package discovery" on setuptools docs.
Python version : 3.10.6
Compiler : Clang 13.1.6 (clang-1316.0.21.2.5)
OS : Darwin
Release : 21.5.0
Machine : arm64
Processor : arm
In Colab running !pip install git+https://github.com/CarperAI/trlx
resulted in
ERROR: No matching distribution found for numpy>=1.23.2
Numpy did drop support in recent releases for Python 3.7 which is what Google Colab is using right now. One might consider downgrade numpy. Per discussions in Discord I put it here so that we don't forget.
No response
numpy==1.23.2
If we had links to benchmarks for the example (and/or test) models, it would be easier to add new models, and keep track of improvements in method implementations. Additionally, during refactoring, it would allow checking that no performance degrading changes were introduced.
This can be a minimal version of #13
No response
No response
Hey there ๐ I've done a first review of the documentation.
For context, I've skimmed the documentation as a beginner to see where are the friction points.
I've some solutions for some, that's why I opened a PR: #64
For others I prefer for now to open an issue to discuss with you on how we can improve it.
So here's the points:
For the example page I was thinking on a readme explaining each different example (and update the website documentation)
A simple colab as a quick starter can be interesting to have. For instance for SB3 integration we created a small one where in 10min you trained your first agent and load a PPO agent from the Hub to play Space Invaders. Having that kind of quick run colab can help people to rapidly get the big picture of the library.
We miss a Bibtex for citing.
WDYT? ๐ค
PR: #64
This is related to #69 (which is why I phrased it in a similar way), but still feels a bit different.
Let's say the model generates a sequence of three related sentences (or paragraphs or tokens) after being prompted (i.e. the rollout). Is there a way to assign them different rewards individually instead of just one single aggregate reward, say based on different criteria? Perhaps I have a constant mass of reward I want to differentially assign to the several parts, but the sum is always constant. In the limit of generality, this would mean being able to assign specific reward values for each individual token/action in the rollout/trajectory. In this use case, the individual rewards can only be computed after the whole sequence of parts has been generated (i.e. you can't reward step 1 before generating step 3).
Is this possible with trlx
? Would it require a custom orchestrator or is there a way to specify individual token rewards right away while keeping the standard structure? Is this even possible with PPO in the first place, or is there a fundamental misunderstanding on my part?
Thanks for building this!
We should also add it as a pre-merge hook
There is no information on what config or machines this was tested on, nor what the results actually were. I was unable to get my configuration to work for the example code, but I might be using an untested deepspeed configuration (e.g., stage 3 offloading). I'd like to test with the validated configuration.
Could you add the tested configurations and machines? Thanks!
A text-to-image RLHF pipeline and orchestrator is needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.