Giter Club home page Giter Club logo

Comments (3)

voidful avatar voidful commented on September 1, 2024

Update Interval: This determines how often the RL agent updates its policy based on collected experiences. A smaller update interval means the agent learns more frequently from recent experiences, while a larger interval allows more experiences to accumulate before learning. In the example above, the update interval is set to 10.

from textrl.

debjitpaul avatar debjitpaul commented on September 1, 2024

Thanks for your prompt reply. If I have understood it correctly, maybe what I was after is the "env_max_length". I am looking for a hyperparameter that can help me control when to give a reward, either on sentence level or phrase level.

In the reward module:
If finish ==True: means the end of the sentence and calculate the reward for the generated sentence.
If we set env_max_length ==4: does this mean that after generating four tokens, we calculate the reward for the generated tokens?

Please feel free to correct me if I am not making sense.
Thank you again!

from textrl.

voidful avatar voidful commented on September 1, 2024

You're on the right track, but there might be a slight misunderstanding. The env_max_length is not directly related to the reward calculation. It is the maximum length of the environment's output, which is the response generated by the model. This parameter sets an upper bound on the number of tokens generated in the response.

To achieve your goal of calculating the reward at the end of a sentence or after a specific number of tokens, you need to modify the reward module itself rather than the env_max_length parameter.

You can implement the following approach:

  1. Monitor the generated tokens and identify the end of a sentence (e.g., by detecting punctuation such as ".", "!", or "?").
  2. Once you detect the end of a sentence or reach a predefined number of tokens (e.g., 4 tokens in your example), calculate the reward for the generated tokens.

By implementing this method, you can control when to calculate and assign a reward based on the generated sentence or phrase. Remember that the reward module should be designed to work well with your specific task and the quality of the generated text you want to achieve.

from textrl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.