Hi, Firstly, great repo! Thanks! Secondly, one qui

Update interval about textrl HOT 3 CLOSED

voidful commented on September 1, 2024

Update interval

from textrl.

Comments (3)

voidful commented on September 1, 2024

Update Interval: This determines how often the RL agent updates its policy based on collected experiences. A smaller update interval means the agent learns more frequently from recent experiences, while a larger interval allows more experiences to accumulate before learning. In the example above, the update interval is set to 10.

from textrl.

debjitpaul commented on September 1, 2024

Thanks for your prompt reply. If I have understood it correctly, maybe what I was after is the "env_max_length". I am looking for a hyperparameter that can help me control when to give a reward, either on sentence level or phrase level.

In the reward module:
If finish ==True: means the end of the sentence and calculate the reward for the generated sentence.
If we set env_max_length ==4: does this mean that after generating four tokens, we calculate the reward for the generated tokens?

Please feel free to correct me if I am not making sense.
Thank you again!

from textrl.

voidful commented on September 1, 2024

You're on the right track, but there might be a slight misunderstanding. The env_max_length is not directly related to the reward calculation. It is the maximum length of the environment's output, which is the response generated by the model. This parameter sets an upper bound on the number of tokens generated in the response.

To achieve your goal of calculating the reward at the end of a sentence or after a specific number of tokens, you need to modify the reward module itself rather than the env_max_length parameter.

You can implement the following approach:

Monitor the generated tokens and identify the end of a sentence (e.g., by detecting punctuation such as ".", "!", or "?").
Once you detect the end of a sentence or reach a predefined number of tokens (e.g., 4 tokens in your example), calculate the reward for the generated tokens.

By implementing this method, you can control when to calculate and assign a reward based on the generated sentence or phrase. Remember that the reward module should be designed to work well with your specific task and the quality of the generated text you want to achieve.

from textrl.

Recommend Projects

Update interval about textrl HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent