I want to make a furuta pendulum. Like <a href="https://www.google.com/imgres?imgurl=h

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Would this library be usable to do a real time deep learning for a furuta pendulum? about rl-tools HOT 2 OPEN

DaxLynch commented on June 14, 2024

Would this library be usable to do a real time deep learning for a furuta pendulum?

from rl-tools.

Comments (2)

DaxLynch commented on June 14, 2024

So I decided to use this library and I am currently writing my own vanilla policy gradient. It seems that the td3 algorithm has the assumption that the state of the environment doesn't automatically get changed due to an observation/action, but with my plan for my project I was going to have it make an observation, and then make an action and get a reward approximately one fixed eval period later, so I figured I couldn't have the two critics making observations and getting rewards as the state would have changed.

from rl-tools.

jonas-eschmann commented on June 14, 2024

Hi @DaxLynch sorry for the late reply. That sounds like a very cool project! Since TD3 is an off-policy method you can populate the replay buffer however you like. I would suggest to create your own training loop by copying the operations from here:

rl-tools/include/rl_tools/rl/algorithms/td3/loop/core/operations_generic.h

Line 71 in 33e9fff

 bool step(DEVICE& device, rl::algorithms::td3::loop::core::State<T_CONFIG>& ts){ 

But removing the step(device, ts.off_policy_runner, ts.actor_critic.actor, ts.actor_buffers_eval, ts.rng); where the off-policy runner is used to generate samples from the environment and place them into the replay buffer.

Then you can have your own code using the policy to generate actions (like e.g. in the evaulation code:

rl-tools/include/rl_tools/rl/utils/evaluation.h

Line 64 in 33e9fff

 evaluate(device, policy, observation_normalized, action_full, policy_eval_buffers); 

). Once you have collected the [state, observation, action, reward, next_state, next_observation, terminated_flag, truncated_flag] tuple you can add it to the replay buffer like in

rl-tools/include/rl_tools/rl/components/off_policy_runner/operations_generic_per_env.h

Line 99 in 33e9fff

 add(device, runner.replay_buffers[env_i], state, observation, observation_privileged, action, reward_value, next_state, next_observation, next_observation_privileged, terminated_flag, truncated); 

It is probably best to do a full rollout of the trajectory and then, when the robot is idle, do some steps of training to prevent the training to interfere with the control frequency.

I think this is a common use-case (where the interaction/inference can not be in lock-step with and governed by the training loop) and I'm thinking of ways to better accommodate it. Once I find some time, I'll probably separate out the rollouts of the off-policy runner into a separate loop so that it can be optional and create an example where the rollouts are conducted independently of the training

from rl-tools.

Would this library be usable to do a real time deep learning for a furuta pendulum? about rl-tools HOT 2 OPEN

Comments (2)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent