kaixhin / acer Goto Github PK

View Code? Open in Web Editor NEW

249.0 249.0 46.0 137 KB

Actor-critic with experience replay

License: MIT License

Python 100.00%

deep-learning deep-reinforcement-learning

acer's People

Contributors

Stargazers

Watchers

acer's Issues

Hello I need help to fix my audio realtek, windows 10 home. My pc is Acer predator Helios 300, I have tried downloading new one from the website but it does not work, please any recommendations, I need help to fix my audio driver

In current form (without minus sign in front of _trust_region_loss), reward obtained just sits at ~9 on cartpole; it might take off after more steps but haven't tried it.
With minus sign in front, reward obtained starts changing immediately.

Doubts on Episodic Memory

ACER/main.py

Line 25 in 5b7ca5d

 parser.add_argument('--memory-capacity', type=int, default=1000000, metavar='CAPACITY', help='Experience replay memory capacity') 

Hello Kaixhin,

I do not get the idea of you using memory capacity greater that T_MAX. Shouldn't the implementation have memory capacity be less than the T_MAX. This looks more intuitive to me. Please let me know what do you think.

Mcelog

mcelog error in Aspire 3 a314 fatal kernel error

the code of the off-policy bias correction

Hello, i have a question about the loss, thanks.

Does bias correction loss miss the item of policies?

Doubt about gradient transfer to shared model

First of all, thanks for publishing this work! It's being really helpful to me to learn many RL concepts in practice!

In train.py the gradient transfer from the local model to the shared model is performed by:

# Transfers gradients from thread-specific model to shared model
def _transfer_grads_to_shared_model(model, shared_model):
  for param, shared_param in zip(model.parameters(), shared_model.parameters()):
    if shared_param.grad is not None:
      return
    shared_param._grad = param.grad

I have a few questions as this function looks sub-optimal to me:

Does the function returns if shared_param.grad is not None to avoid multiple workers to copy gradients at the same time?
If that is the case, it looks like the gradients are thrown away, am I correct?

If my understanding is not completely wrong, I was wondering if there is a reason behind not using a Lock mechanism when transferring the gradients. Could the function be as follows?

# Transfers gradients from thread-specific model to shared model
def _transfer_grads_to_shared_model(model, shared_model, lock):
 with lock: 
   for param, shared_param in zip(model.parameters(), shared_model.parameters()):
     shared_param._grad = param.grad

The same also may apply to the optimizer.step() call.

KL Divergence

ACER/train.py

Line 71 in 5b7ca5d

kl = F.kl_div(distribution.log(), ref_distribution, size_average=False)

Shouldn't the code be like F.kl_div(distribution, ref_distribution, size_average=False). Why there is a log of the distribution.

Doubts

Could you please let me know why there is a negative sign. I think that since we have already defined kl-divergence in the step before, we do not need a negative sign here. Please let me know how do you see this.

ACER/train.py

Line 81 in f22b07c

(-kl).backward(retain_graph=True)

batch_size for off-policy learning

Hey,
in the paper for each off-policy learning only one trajectory is sampled, while here you use 16. For the low-level input this won't be too much slower but for higher dimensions this might be an issue? what do you think?

feed the previous action to lstm

Hey,
nice work! One quick question, you mentioned The agent also receives the previous action and reward, but this part is not part of the acer algorithm, but only from the navigation paper right?

Doubts on memory

ACER/memory.py

Line 24 in 3676af0

def sample(self, maxlen=0):

  def sample(self, maxlen=0):
    L = len(self.memory)
    if L > 0:
      e = random.randrange(L)
      mem = self.memory[e]
      T = len(mem)
      # Take a random subset of trajectory if maxlen specified, otherwise return full trajectory
      if maxlen > 0 and T > maxlen + 1:
        t = random.randrange(T - maxlen - 1)  # Include next state after final "maxlen" state
        return mem[t:t + maxlen + 1]
      else:
        return mem
    else :
      return None

Can we write the code in this way? I do not get the reason behind using while (True)

Configurations for Atari games

Hello, @Kaixhin . I am very glad to find your implemention for ACER in pytorch, and I want to do something based on it. However, I want to test on Atari games(pixel games) instead of Cartpole (control games). There are a lot of hyper-parameters, I wonder if I need to tune them or keep them as provided by you? Looking forward to your reply.

kaixhin / acer Goto Github PK

acer's People

Contributors

Stargazers

Watchers

Forkers

acer's Issues

Recommend Projects

Recommend Topics

Recommend Org