The airl_mountaincar from huangjialian

I replicated your code in Pytorch and used the same number of hidden units and position only. My reward is also just a function of state not state-action. The reward I use to update the generator is also the discriminator logistic ratio.

The only difference between us might just be the number of trajectories I collect every iteration, learning rate, and regularization. For discriminator I do your random batch update. But for the generator I sweep through all the (state, action) pairs a few times. There could be minor bugs in my code but I trust my implementation.

But I got quite a weird behavior for the reward function.
itr 25:

itr 75:

itr 175:

itr 275:

itr 475:

itr 775:

You see the trend here. What could be the possible reason that this is happening?

Thanks,
Ran

2_airl.py v_preds_next 问题

你好
/2_airl.py 217行报错，v_preds_next 没有赋值，是不是前面漏掉了

Old_Policy的更新是怎么回事？

xxx:
您好，非常抱歉，过了这么久才来回复。感谢您的关注。造成你疑问的原因是我没有写清楚。

这里的逻辑是这样的:

首先需要两个Policy的原因是: 在PPO算法中我们想要把On-Policy的训练变成Off-Policy, 因此有了两个Policy。这样做的目的是加快训练过程。这两个Policy网络的不同之处就是网络的参数(权制，偏置)不同, Old_Policy的参数会滞后Policy一段时间。

与环境交互的应该是Old_Policy(我之前写错了，已更新)，它产生的数据用来训练Policy；训练好多次Plocy之后(体现在2_airl.py里面的if episode % 4 == 0)，然后再使用PPO.assign_policy_parameters()这一行

AIRL_MountainCar/2_airl.py

Line 224 in ec707ef

PPO.assign_policy_parameters()

来将更新后Policy参数存到Old_Policy网络里面。

关于PPO算法，具体我主要参考的是台大李宏毅老师的视频

真的很感谢您的提问，这帮助我很多。

阿梁
2021.9.22

------------------ 原始邮件 ------------------
发件人: "" <******>;
发送时间: 2021年8月30日(星期一) 下午5:54
收件人: "Jack Huang"[email protected];
主题: 回复：请教github AIRL_MountainCar repo

您好，请问有任何进展吗？之前提到的第一个问题是我弄错了，抱歉！但第二个问题，old_policy update我还是不太明白。

xxx

---原始邮件---
发件人: "Jack Huang"[email protected]
发送时间: 2021年8月23日(周一) 晚上6:56
收件人: "xxx";
主题: 回复：请教github AIRL_MountainCar repo

谢谢您的邮件。

我最近会去看一下这个项目，确定一下这个问题。可能要等两天。这是很久之前写的程序，或许还需要更新一下。到时我再回复您。

------------------ 原始邮件 ------------------
发件人: "xxx" [email protected];
发送时间: 2021年8月23日(星期一) 中午1:13
收件人: "Jack Huang"[email protected];
主题: 请教github AIRL_MountainCar repo

Jack Huang,

您好！

我在github上看到了您AIRL_MountainCar的代码，想请问一下这个repo已经更新完整了吗？特别是generator部分，感觉policy update有些细节不是很清楚，比如：

在2_airl.py中PPO.train的input和定义的不一致，少了一个self.Old_Policy.obs
在2_airl.py中对于old_policy只进行了初始定义，后续应该还有update?

如果理解有错，烦请指出，十分希望能得到您的解答。

2021年8月23日

huangjialian / airl_mountaincar Goto Github PK

airl_mountaincar's People

Stargazers

Watchers

Forkers

airl_mountaincar's Issues

关于依赖包版本

Weird reward behavior

2_airl.py v_preds_next 问题

Old_Policy的更新是怎么回事？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent