quangr / jax-rl Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 34 KB

jax version of ppo algorithm in mujoco enviroment, achieve SOTA(tianshou)

Python 99.47% Shell 0.53%

jax mujoco ppo reinforcement-learning rl envpool

jax-rl's People

Contributors

Watchers

jax-rl's Issues

Add recurrent neural network

I guess there are no standard implement of lstm version ppo. First we should focus on the training implement
implement of cleanrl :
just save initial_lstm_state, and burn in with prefix data in buffer

The impact of random shuffle

Should write a report whether random shuffle help improve the performance, some researchers believes that shuffle buffer data will lead to less covariance, which will lead to better gradient approximation (and help avoid catastrophic forgetting?)

if comment out jax.random.permutation(subkey, x) in HalfCheetah-v3 env we will get
Nan, Inf or huge value in CTRL at ACTUATOR 0. The simulation is unstable. Time = 1.1500

reproduce ppo benchmark

I can't find a way to make ppo compariable to tianshou benchmark, especially in half-cheetah env, where we can't acheive half of score..

Benchmark:

Tianshou: Hopper-v3: 2609.3+-700.8 Half-Cheetah-v3: 5783.9+-1244.0

My: Hopper-v3:1683+-307 Half-Cheetah-v3: 1926+-254

Where goes wrong?

So far I have test following assumption

Did the done step being treated right?

Result: add masking in ppo step and make using the value bootstrap not improve much

Did the envpool version matter?

Result: change different version won't help.

Did the learning decay not working right? since the tianshou using 3m step, so the learning rate will only decay to 2/3 in 1m step.

Result: Setting learning at a constant result or setting total step to 3m not improve much.

Did the action remap correct?

Result: copy remap method from tianshou, still not work

Did the grad step correct?

Result: When Use exact data from tianshou, the loss produced by them are same.

Did the rollout phase have problem? Did the random phase random enough?

Result: Don't know how to test this.

It turns out that we need a observation normalizer

Add env wrapper

The problem for now:

Maybe should make wrappers' state frozen
envpool seem not consider handler change after reset , so now we will have two return handle by xla() and reset()

quangr / jax-rl Goto Github PK

jax-rl's People

Contributors

Watchers

jax-rl's Issues

Add recurrent neural network

The impact of random shuffle

reproduce ppo benchmark

Add env wrapper

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent