Giter Club home page Giter Club logo

reinforcement-learning's Introduction


Minimal and clean examples of reinforcement learning algorithms presented by RLCode team. [한국어]

Maintainers - Woongwon, Youngmoo, Hyeokreal, Uiryeong, Keon

From the basics to deep reinforcement learning, this repo provides easy-to-read code examples. One file for each algorithm. Please feel free to create a Pull Request, or open an issue!

Dependencies

  1. Python 3.5
  2. Tensorflow 1.0.0
  3. Keras
  4. numpy
  5. pandas
  6. matplot
  7. pillow
  8. Skimage
  9. h5py

Install Requirements

pip install -r requirements.txt

Table of Contents

Grid World - Mastering the basics of reinforcement learning in the simplified world called "Grid World"

CartPole - Applying deep reinforcement learning on basic Cartpole game.

Atari - Mastering Atari games with Deep Reinforcement Learning

OpenAI GYM - [WIP]

  • Mountain Car - DQN

reinforcement-learning's People

Contributors

20chase avatar bluemelon715 avatar dnddnjs avatar fredcallaway avatar hyeokreal avatar jcwleo avatar keon avatar nextco avatar wooridle avatar zzing0907 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reinforcement-learning's Issues

module 'pandas' has no attribute 'computation'

run python Gridworld_DQN.py error:

File "/home/wangdawei/anaconda2/envs/py3/lib/python3.6/site-packages/dask/dataframe/core.py", line 38, in <module> pd.computation.expressions.set_use_numexpr(False) AttributeError: module 'pandas' has no attribute 'computation'

first i thought it is a pandas problem, finally it is a dask version problem

update dask to new version solve it.

The A2C carpole is wrong?

I have compared the implementation and the book "RL: an introduction". It seems the mse loss and cross-entropy loss can not get the update rule as Actor-Critic. It is w=w+alphaIdelta*grad for value function, and theta = theta + alpha *I delta grad(ln pi(action)). Especially for value function, mse loss gets another v^hat multiplied.

sampling from deque

Great stuff! This has been extremely helpful! My only suggestion would be in line 78, changing mini_batch = random.sample(self.memory, batch_size) to mini_batch = random.sample(list(self.memory), batch_size), otherwise you get the following error, "TypeError: Population must be a sequence or set. For dicts, use list(d)."

issue regarding saved models

i was looking at the code for breakout and i saw various saved models ,but the code is only for one saved model then how the other models were saved, i want to know if they were saved after making some changes to the code

Saved model usage

Could you please provide examples on how to use the saved model (.h5 files) at test time for Grid world, Cartpole environments?

Thanks,
Akilesh

update target_model before loading saved model in cartpole_dqn.py

# initialize target model with same weights as the model, in case we load a model
       #shouldn't this be done after load_model?
       self.update_target_model()

       if self.load_model:
           self.model.load_weights("./save_model/cartpole_dqn.h5")

could be

if self.load_model:
            self.model.load_weights("./save_model/cartpole_dqn.h5")

 self.update_target_model()

so that if we load a saved model, the target_model would have the saved weights rather than starting with the Keras-initialized weights.

I'm going to test this but it seems like the loaded model would be using an inferior target_model for at least the first episode and the model weights could get adjusted in the wrong way in that first episode, slightly slowing down it's learning.

Giving image as input in Gridworld

Hi,

Is it possible to give image as input in Gridworld environment. Can you suggest ways in which this can be done? Are there ways of converting tkinter Canvas into a numpy array which can then be fed into a ConvNet?

Thanks,
Akilesh

Number of actions in deep SARSA grid world

Why does the implementation (deep_sarsa_agent.py) have action_space = [0, 1, 2, 3, 4] when there are only 4 possible actions that the agent can take (as specified in the environment.py)?

Thanks,
Akilesh

Failing to converge with increase in grid-size (Grid World)

If I increase both the HEIGHT and WIDTH from 5 to 10 keeping the obstacles and the final goal at the same position, Deep SARSA network doesn't seem to converge. What do you think is the problem? Should I increase the depth or dimensions of the hidden layer in actor and critic networks?

Thanks,
Akilesh

How to run threading while using Keras and tensorflow

Hi, I would like to test some hyperparameters, with using threading, that will be much faster. But when I run threading on DQN and DDQN algorithm, the error says:
<Tensor Tensor("dense_1/kernel:0", shape=(2, 32), dtype=float32_ref) is not an element of this graph>
Seems Keras can't support threading, but your A3C works, it's strange for me.

Implementing policy gradient when number of output classes is large

Hello,

I am aware of this smart trick of implementing policy gradient (see his for a reference: https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/3-reinforce/cartpole_reinforce.py). Specifically, categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, we can set p_a = advantage * [index of action a in 1-hot-vector representation). Meanwhile, q_a is the output of the policy network, which is the probability of taking the action a, i.e. policy(s, a).

However, when the classes of output is huge (e.g. as in machine translation or language modeling), I simply cannot convert the output into one hot vector in the first place, using to_categorical(output, num_classes=output_class) function in keras.

Because of this, I cannot apply the trick to compute p_a.

So how to implement policy gradient in this case?

I hope I make my question in a clear way!

Many thanks for your help!

Best,

Cuong

@fredcallaway: I saw you commented on the code so I tagged you here as well. If you can give me an answer, I would really appreciate it ...

Env BreakoutDeterministic-v4 not found

When I try to run Breakout_DQN I get the following error:
gym.error.DeprecatedEnv: Env BreakoutDeterministic-v4 not found (valid versions include ['BreakoutDeterministic-v3', 'BreakoutDeterministic-v0'])

What version of gym are you using? (I'm using 0.8.1)

Catastrophic collapse in episode score on cartpole_a3c

Hi,
First of all I just want to say awesome work on the library overall, really love the concept 👍

I have an issue where cartpole_a3c will converge relatively quickly (around ep 300-400). Then keep doing well, and then suddenly collapsing and not recovering. Has anyone else experienced this?

Moving obstacles in Grid World

I don't understand the effect of moving obstacles in grid world (Deep SARSA and REINFORCE) as in environment.py, the negative rewards are hard-coded for obstacles at coordinates [0, 1], [1, 2], [2, 3].

Thanks,
Akilesh

Use of memory in Cartpole A3C

Hi,

Could anyone please elaborate on the use of memory in Cartpole A3C implementation? The saved samples haven't been used during training.

Thanks,
Akilesh

Use traned agent

Hello, trained agent play CartPole-v1 with score 500, but when I restart it with ...
self.load_model from = True and with correct name, it start learning again with low score results.

How can I load weights and start trained agent to play, without learning?

Reference links for each algorithm

Hi,

It's a really great repo for learning RL. However, if you could provide some links/blogs containing explanation of each algorithm, that will be even more beneficial for users.

Thanks,
Akilesh

A3C on GPU

Hi.
Really nice job. This is the most readable and "easiest" code I found for the A3C implementation. With regular tensorflow on CPU the code is working fine, but with tensorflow-gpu I get the error below.

Do you know, why this is happening and is it possible to get the A3C code working with GPU accelleration?
Thanks in advance!

Caused by op 'IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7', defined at:
  File "C:\Users\trek\.vscode\extensions\ms-python.python-2018.3.1\pythonFiles\PythonTools\visualstudio_py_debugger.py", line 2068, in new_thread_wrapper
    func(*posargs, **kwargs)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\threading.py", line 882, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\threading.py", line 914, in _bootstrap_inner
    self.run()
  File "d:\Thesis\Code\examples\cartpole\a3c2.py", line 159, in run
    action = self.get_action(state)
  File "d:\Thesis\Code\examples\cartpole\a3c2.py", line 209, in get_action
    policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0]
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1835, in predict
    verbose=verbose, steps=steps)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1330, in _predict_loop
    batch_outs = f(ins_batch)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 2476, in __call__
    session = get_session()
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 192, in get_session
    [tf.is_variable_initialized(v) for v in candidate_vars])
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 192, in <listcomp>
    [tf.is_variable_initialized(v) for v in candidate_vars])
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\variables.py", line 1203, in is_variable_initialized
    return state_ops.is_variable_initialized(variable)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 180, in is_variable_initialized
    return gen_state_ops.is_variable_initialized(ref=ref, name=name)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 175, in is_variable_initialized
    result = _op_def_lib.apply_op("IsVariableInitialized", ref=ref, name=name)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\Users\trek\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7' and 'Adam_1/iterations: Cannot merge devices with incompatible types: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/device:CPU:0'
	 [[Node: IsVariableInitialized_16/IsVariableInitialized_22/IsVariableInitialized/IsVariableInitialized_13/IsVariableInitialized_6/IsVariableInitialized_7 = IsVariableInitialized[_class=["loc:@Adam/beta_1", "loc:@Adam/beta_2", "loc:@Adam_1/iterations", "loc:@Variable_27", "loc:@Variable_4", "loc:@dense_3/bias"], dtype=DT_FLOAT](Variable_4)]]

Add comment on the use of categorical cross entropy in REINFORCE and a2c

I was surprised to see this loss function because it is generally used when the target is a distribution (i.e. sums to 1). This is not the case for the advantage estimate. However, I worked out the math and it does appear to be doing the right thing which is neat!

I think this trick should be mentioned in the code.

A3C for Gridworld

I am trying to implement A3C for gridworld by appropriately modifying run() method of A3C cartpole example. However, I am getting the below error:

Exception ignored in: <bound method PhotoImage.del of <PIL.ImageTk.PhotoImage object at 0x7f7b807077b8>>
Traceback (most recent call last):
File "/home/akb/.local/lib/python3.4/site-packages/PIL/ImageTk.py", line 130, in del
name = self.__photo.name
AttributeError: 'PhotoImage' object has no attribute '_PhotoImage__photo'
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
self.run()
File "deep_a3c.py", line 159, in run
env = Env()
File "/home/akb/reinforcement-learning/1-grid-world/6-deep-sarsa/environment_a3c.py", line 21, in init
self.shapes = self.load_images()
File "/home/akb/reinforcement-learning/1-grid-world/6-deep-sarsa/environment_a3c.py", line 58, in load_images
Image.open("../img/rectangle.png").resize((30, 30)))
File "/home/akb/.local/lib/python3.4/site-packages/PIL/ImageTk.py", line 124, in init
self.__photo = tkinter.PhotoImage(**kw)
File "/usr/lib/python3.4/tkinter/init.py", line 3419, in init
Image.init(self, 'photo', name, cnf, master, **kw)
File "/usr/lib/python3.4/tkinter/init.py", line 3375, in init
self.tk.call(('image', 'create', imgtype, name,) + options)
RuntimeError: main thread is not in main loop

Any help on this error. If possible, could you provide an implementation of A3C on Gridworld.

Thanks,
Akilesh

Query regarding 'advantages' in a2c

The actor net takes state as input and outputs a policy containing the probability of each action. In train_model(), the ground truth for training actor net is 'advantages' which is not a probability distribution over possible actions. So, how does the categorical cross-entropy computation between the predicted output of actor net and 'advantages' work?

Thanks,
Akilesh

The issue about breakout_a3c.py in 3-atari, when i execute source

PC1
cpu: intel i-5
no graphic card
python 3.5
tensorflow 1.14
keras 2.3.0

PC2
cpu: inter i-7
rtx-2070
python 3.5
tensorflow 1.14
keras 2.3.0

When i execute breakout_a3c.py ,
The following problem occurs on both computers
I guess that the issue is related to threading library...

Model: "model_18"

Layer (type) Output Shape Param #
input_9 (InputLayer) (None, 84, 84, 4) 0

conv2d_17 (Conv2D) (None, 20, 20, 16) 4112

conv2d_18 (Conv2D) (None, 9, 9, 32) 8224

flatten_9 (Flatten) (None, 2592) 0

dense_25 (Dense) (None, 256) 663808

dense_27 (Dense) (None, 1) 257
Total params: 676,401
Trainable params: 676,401
Non-trainable params: 0

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Exception in thread Thread-9:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 207, in run
action, policy = self.get_action(history)
File "/home/name/AdvRL/reinforcement-learning/3-atari/1-breakout/breakout_a3c.py", line 327, in get_action
policy = self.local_actor.predict(history)[0]
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1462, in predict
callbacks=callbacks)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 276, in predict_loop
callbacks.model.stop_training = False
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/network.py", line 323, in setattr
super(Network, self).setattr(name, value)
File "/home/name/.local/lib/python3.5/site-packages/keras/engine/base_layer.py", line 1215, in setattr
if not _DISABLE_TRACKING.value:
AttributeError: '_thread._local' object has no attribute 'value'

Why use self.batch_size instead of batch_size

From reinforcement-learning/2-cartpole/1-dqn/cartpole_dqn.py/train_model

def train_model(self):
    if len(self.memory) < self.train_start:
        return
    batch_size = min(self.batch_size, len(self.memory))
    mini_batch = random.sample(self.memory, batch_size)

    update_input = np.zeros((batch_size, self.state_size))
    update_target = np.zeros((batch_size, self.state_size))
    action, reward, done = [], [], []

    for i in range(self.batch_size):
        update_input[i] = mini_batch[i][0]
        action.append(mini_batch[i][1])
        reward.append(mini_batch[i][2])
        update_target[i] = mini_batch[i][3]
        done.append(mini_batch[i][4])

    target = self.model.predict(update_input)
    target_val = self.target_model.predict(update_target)

    for i in range(self.batch_size):
        # Q Learning: get maximum Q value at s' from target model
        if done[i]:
            target[i][action[i]] = reward[i]
        else:
            target[i][action[i]] = reward[i] + self.discount_factor * (
                np.amax(target_val[i]))

    # and do the model fit!
    self.model.fit(update_input, target, batch_size=self.batch_size,
                   epochs=1, verbose=0)

In the this part of code, why you use self.batch_size after take the minimum value between self.batch_size and the length of memory? Would batch_size be better?

reinforcement learning real life use cases

Hi Shangtong , I am new to the reinforcement learning part , I have a sceanrio in which I have a machine learning model predicting target properly.I want to figure out the input paramters so as to attain a particular target value .Any suggestions will be great ..
The input parameter may range from 4-20 and since the input parameters are discrete numeric values there may be lot many combinations of the same .

Can this code run other atari game beside breakout?

I want to run other atari game, it's performance looks doesn't good. Could anyone help me? Whether I can achieve this goal by change "gym.make('ENV_NAME')", and it's real_action? help me plllllz, appreciate so MUCH
I have changed the code like I wrote above, it's performance not so good, and occurs some ERRORS. like this:
TypeError: unsupported operand type(s) for /: 'tuple' and 'float'Traceback (most recent call last):

File "ddqn_spaceinvaders.py", line 372, in
agent.train_replay(step)
File "ddqn_spaceinvaders.py", line 235, in train_replay
history[i] = np.float32(mini_batch[i][0] / 255.)

Batch size in A2C Cartpole

Are there no variants of A2C with mini-batch update instead of training every time step? If yes, could you tell the pros and cons of such an approach?

Thanks,
Akilesh

couple a3c questions / recommendations for generalizing beyond Atari

First, thanks for making this. It's very easy to get started with and has really helped me move things forward on a personal project of mine I've been struggling with for months. This is really awesome work. Thanks again.

In my efforts to tweak the code from your A3C cartpole implementation to work with my own custom OpenAI environment, I've discovered a few things that I think can help make it generalize a bit more.

  1. the paper says all the layers except the output are generally shared between the actor and critic. I'm curious - why do both your actor and critic networks have a private hidden layer before the output? Mine has 4 shared relu layers each with 1600 neurons, then the actor gets a softmax and the critic gets a linear layer for the output, and this has really helped my stability issues, though my environment is quite a bit more complicated than Atari, so I'm not sure if there's some advantage to each having a private layer. I could totally be missing something here though.
  2. My environment has a massive discrete action space (1547 actions), which is what led me here. So, one change I made was to add an action filter to your "get_action" function. This is a vector of ones and zeros output by my environment that can be used to zero out the probabilities for invalid actions. I needed to renormalize the output from the actor so the probabilities all sum to 1 again after doing this, but it's really helped speed things up. Probably unnecessary unless you're dealing with large action spaces, but crucial if you are. You can always penalize invalid actions instead, but in a large action space, but I've found that adds a ton of time to training. Anyway, here's what I did (sorry - can't get the formatting to work here for some reason):
    def get_action(self, state,actionfilter): policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0] policy=np.multiply(policy,actionfilter) probs=policy/np.sum(policy) action=np.random.choice(self.action_size, 1, p=probs)[0] return action
    where action filter is provided through a custom function with the environment. It would be easy enough to only use it if it was passed, or just default it to a vector of ones with the same size as the action space.
  3. I'm in the process now of giving each actor a different epsilon that's used to determine how much it explores, which will also be feed into the 'get_action' function. The original paper claims that if each agent has a different exploration policy, it can really help with stability, so I'm hoping this will help a bit more. To date, I've had to slowly increment my learning rates to find one that works (for me, I had to go all the way down to 1e-10).
    Anything greater than that can cause an actor to return NaNs and then the whole thing falls apart, and anything lower and it just inches along at a glacial pace. I wish I could take it up a bit, but the NaNs are killing me. I tried gradient clipping, but it's really hard finding a good threshold to use. Anyway, implementing different exploration policies should be pretty easy to do... might be worth checking out. I suppose it would also be possible to randomly pick a more abstract exploration type as well during initialization. Like have one that's pure greedy, another epsilon-greedy with some random epsilon, and maybe a couple other types of policies thrown in there for kicks. I'm going to test this out this week to see if it has any effect. I can report back if you're interested.

Saving QLearning Agent

First of all, great tutorials! I've been basing my own projects with this repo to better understand RL but through the process I found that persisting the QLearning Agent turns out to be really difficult because of it's final size.

I tried pickle, json, jsonpickle, cPickle, marshal, klepto, dbm and finally h5py and I noticed it might not be as easy as it seems, because none of these worked. My 64-bit Linux Mint system kills the process and leaves a 0 bytes file where the q_table should be.

It actually works, rewards getting better and all but if it's trained to a point it becomes impossible to persist it back to disk, it seems. I tried creating swap space from the intuition that it was running out of memory, to no avail.

Would be glad if anyone has a fix for this. Thanks!

Tutorial

Dear

Do you have any tutorial for your code listed in this github ? or have you created your tutorial for your code ?

Thx

A3C algorithm - background

Amazing work!!
Tried running the A3C algorithm for breakout and it works great!
Where did you get the background information in order to write the code? It's a little bit different than what was explained in the "Asynchronous Methods for Deep Reinforcement Learning" paper.
Thanks :)

My code is very poor in learning 2048 game using Double DQN

Firstly, thanks for the great collection of code and articles. The articles were very useful in understanding DQN and implementing it.

However, my code is very bad in learning. I am not sure what is wrong with my code. I am using DDQN and passing rewards based on different criteria. Also the state is just a normalized version of the board itself.

My code repo is here https://github.com/codetiger/MachineLearning-2048
Let me know if you can review and help me understanding why my code doesnot learn anything even after 1000 episodes.

How to add Dropout

Hi all,

Thanks for your amazing project!

I have a question. If I want to add dropout into the network for policy gradient, how can I do that?
I think in order to do that, I need to completely change the code. Right now the workflow is as follows.
Having state -> do a forward computation -> having the output -> compute the gradient -> create a new input, output to train the network -> perform training the network with the <input, output> for one epoch -> repeating again.

However, to add dropout we need to change the workflow as follows:
Having state -> do a forward computation -> having the output -> compute the gradient -> backpropogate the gradient -> modifying network parameters -> repeating.

This would really complicate for an automatic differentiation system like Keras, I think. Any idea?

Thanks a lot for your help!

Best,

Convergence

How many days/episodes did it take until it converged in breakout_a3c? Did you try using LSTM for faster convergence?

Pong Policy Gradient-important error in the definition of the convolutional net.

I tried to run Pong Policy Gradient for 2000 episodes on the original file with no results whatsoever. Then boosted reward for positive points (points scored by the learner(right side) to 20 and got this result:
pong_reinforce_v1 02x20x-1
I boosted learner's points rewards to 100 and after around 1500 episodes got a slight improvement, similar to that in the picture. I ran it to 8100 episodes and there was no improvement except for a slightly smaller variance. Forgive my being naive but successfully running three versions of cartpole I was expecting some logical results.
As you can see from the picture variance is big and after a 800-900 improvement the results seem stagnant.
Has anybody run it for some more episodes and tried to tweak the rewards and brought results up and variance down?
Given the policy should I boost the penalty for the teacher's (left opponent's) scoring points?
Any guidance will be appreciated. Thanks.

Why are you using SARSA instead of Q-Learning?

You are doing Q-Learning:

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)

target[i][action[i]] = reward[i] + self.discount_factor * (

But isn't that SARSA?

                a = np.argmax(target_next[i])
                target[i][action[i]] = reward[i] + self.discount_factor * (target_val[i][a])

Is that a mistake or is that a valid approach? I'm new to RL...

Expected future rewards

Hi,

In Cartpole ddqn the following Q(s,a) formula has target_val, is it one step reward or is it expected future rewards?

target[i][action[i]] = reward[i] + self.discount_factor * ( target_val[i][a])

play part

Is there any part to play the breakout by calling the saved model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.