keras-rl / keras-rl Goto Github PK

View Code? Open in Web Editor NEW

5.5K 204.0 1.4K 1.38 MB

Deep Reinforcement Learning for Keras.

Home Page: http://keras-rl.readthedocs.io/

License: MIT License

Python 100.00%

keras tensorflow theano reinforcement-learning neural-networks machine-learning

keras-rl's Introduction

Deep Reinforcement Learning for Keras

What is it?

keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras.

Furthermore, keras-rl works with OpenAI Gym out of the box. This means that evaluating and playing around with different algorithms is easy.

Of course you can extend keras-rl according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. Documentation is available online.

What is included?

As of today, the following algorithms have been implemented:

Deep Q Learning (DQN) [1], [2]
Double DQN [3]
Deep Deterministic Policy Gradient (DDPG) [4]
Continuous DQN (CDQN or NAF) [6]
Cross-Entropy Method (CEM) [7], [8]
Dueling network DQN (Dueling DQN) [9]
Deep SARSA [10]
Asynchronous Advantage Actor-Critic (A3C) [5]
Proximal Policy Optimization Algorithms (PPO) [11]

You can find more information on each agent in the doc.

Installation

Install Keras-RL from Pypi (recommended):

pip install keras-rl

Install from Github source:

git clone https://github.com/keras-rl/keras-rl.git
cd keras-rl
python setup.py install

Examples

If you want to run the examples, you'll also have to install:

gym by OpenAI: Installation instruction
h5py: simply run pip install h5py

For atari example you will also need:

Pillow: pip install Pillow
gym[atari]: Atari module for gym. Use pip install gym[atari]

Once you have installed everything, you can try out a simple example:

python examples/dqn_cartpole.py

This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that?

Some sample weights are available on keras-rl-weights.

If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request!

External Projects

Starcraft II Learning Environment

You're using Keras-RL on a project? Open a PR and share it!

Visualizing Training Metrics

To see graphs of your training progress and compare across runs, run pip install wandb and add the WandbLogger callback to your agent's fit() call:

from rl.callbacks import WandbLogger

...

agent.fit(env, nb_steps=50000, callbacks=[WandbLogger()])

For more info and options, see the W&B docs.

Citing

If you use keras-rl in your research, you can cite it as follows:

@misc{plappert2016kerasrl,
    author = {Matthias Plappert},
    title = {keras-rl},
    year = {2016},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/keras-rl/keras-rl}},
}

References

Playing Atari with Deep Reinforcement Learning, Mnih et al., 2013
Human-level control through deep reinforcement learning, Mnih et al., 2015
Deep Reinforcement Learning with Double Q-learning, van Hasselt et al., 2015
Continuous control with deep reinforcement learning, Lillicrap et al., 2015
Asynchronous Methods for Deep Reinforcement Learning, Mnih et al., 2016
Continuous Deep Q-Learning with Model-based Acceleration, Gu et al., 2016
Learning Tetris Using the Noisy Cross-Entropy Method, Szita et al., 2006
Deep Reinforcement Learning (MLSS lecture notes), Schulman, 2016
Dueling Network Architectures for Deep Reinforcement Learning, Wang et al., 2016
Reinforcement learning: An introduction, Sutton and Barto, 2011
Proximal Policy Optimization Algorithms, Schulman et al., 2017

keras-rl's People

Contributors

Stargazers

Watchers

Forkers

rhythm92 cbentes dohmatob paulhendricks noe zxzhijia jaekyeom jaystings asmith26 hedgefair jhayes14 nishithbsk zeyuan1987 chagge the-moliver dds-dong kashif snazz2001 oztc rmckeown xypan1232 jorahn shashankg7 arnavkj1995 ml-ai-nlp-ir thierry-silbermann qgzang huleg jialrs bkj zergey asawq2006 techscientist kentonmurray yongduek nlkim0817 xlou amn41 milossimic mnrmja007 angelo337 makgie abhishekraok neverspill andres-hernandez jz3707 ebraheemf andrewliao11 coolspiderghy shihuaxing jmrinaldi tonyan parksanghyoun etotheipluspi arvindpereira vyraun ryanhope ankdesh zzmjohn hearthstoneboss yif0 xcbat teslasloan deepcompute henryslzhao xfdywy nagyist leighton andreas-koukorinis risheekg kidzik aistrych sigmaquan dylanthomas sohojoe stevenlol liubo-cs jwgu khudkhud mayanxin89 deepalcoholic jadielam tracy-ming hmate9 freeyawork cpaxton felipeasg victorzsl angrysquirrel wwxfromtju ferryzhou sungsulim tailintalent raopku ashkoofaraz twmht augustusmyc ranjanz leezqcst o0neup

keras-rl's Issues

pull request Travis CI error when import theano

in my pull request "add duel network to dqn"
when using theano backend, the Travis CI failed. However, I have checked that there is no problem when using theano in my computer.
here is the error log link. It seems like that something wrong with import theano

could anyone help me?

SequentialMemory() args/kwargs broken

You have some conflicts missmatch in args vs kwargs in Memory()/SequentialMemory()....

My code hits this error:

Traceback (most recent call last):
  File "dqn.py", line 102, in <module>
    main(args)
  File "dqn.py", line 72, in main
    memory = SequentialMemory(STEPS_PER_EPISODE*500)
  File "/home/ryan/workspace/keras-rl/rl/memory.py", line 107, in __init__
    super(SequentialMemory, self).__init__(**kwargs)
TypeError: __init__() takes at least 2 arguments (1 given)

And your examples hit this error:

Traceback (most recent call last):
  File "examples/cdqn_pendulum.py", line 65, in <module>
    memory = SequentialMemory(limit=100000, window_length=1)
TypeError: __init__() got an unexpected keyword argument 'window_length'

Submit Pull request to keras-contrib?

Keras has created a new official keras-contrib repository where they will accept a broader range of contributions than mainline keras, for eventual inclusion in mainline if it becomes widely used. Would you consider moving keras-rl there?

https://github.com/farizrahman4u/keras-contrib

TensorBoard fails on CartPole example

I appended/modified this snippet:

log_filepath = "temp/logs"
tb = TensorBoard (log_dir=log_filepath, histogram_freq=1)
dqn.fit(env, nb_steps=50000, visualize=True, verbose=2, callbacks=[tb])

to dqn_cartpole.py and got the following error:

Traceback (most recent call last): File "keras_rl_test.py", line 50, in <module> dqn.fit(env, nb_steps=50000, visualize=True, verbose=2, callbacks=[tb]) File "/home/user/anaconda/lib/python2.7/site-packages/rl/core.py", line 41, in fit callbacks.set_model(self) File "/home/user/anaconda/lib/python2.7/site-packages/keras/callbacks.py", line 51, in set_model callback.set_model(model) File "/home/user/anaconda/lib/python2.7/site-packages/keras/callbacks.py", line 584, in set_model for layer in self.model.layers: AttributeError: 'DQNAgent' object has no attribute 'layers'

a3c for atari

Hi,

I have been trying to get your a3c implementation to work for atari games using the NIPS/NATURE network.

Using the network initialization from dqn and a few modification to the a3c.py file I keep getting an error on line:282.

Error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "a3c_atari.py", line 139, in run
    agent.fit(gym.make(ENV_NAME), nb_steps=1750000, visualize=False, verbose=verbose,callbacks = callbacks)
  File "/home/pavitrakumar/Desktop/keras-rl-master-a3c-files/core.py", line 125, in fit
    metrics = self.backward(reward, terminal=done)
  File "/home/pavitrakumar/Desktop/keras-rl-master-a3c-files/a3c.py", line 309, in backward
    means, variances = self.actor_train_fn(inputs)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 811, in __call__
    return self.function(*inputs)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices start at 0) has shape[3] == 32, but the output's size on that axis is 12288.
Apply node that caused the error: GpuElemwise{Composite{(i0 + (i1 * (i2 / i3)))}}[(0, 2)](GpuElemwise{Composite{(i0 * ((i1 * i2) + (i3 * i4)))}}[(0, 1)].0, CudaNdarrayConstant{[[[[  9.99999975e-05]]]]}, Rebroadcast{?,?,0}.0, GpuFromHost.0)
Toposort index: 321
Inputs types: [CudaNdarrayType(float32, (True, True, True, False)), CudaNdarrayType(float32, (True, True, True, True)), CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, (True, True, True, True))]
Inputs shapes: [(1, 1, 1, 12288), (1, 1, 1, 1), (8, 8, 1, 32), (1, 1, 1, 1)]
Inputs strides: [(0, 0, 0, 1), (0, 0, 0, 0), (256, 32, 0, 1), (0, 0, 0, 0)]
Inputs values: ['not shown', CudaNdarray([[[[  9.99999975e-05]]]]), 'not shown', CudaNdarray([[[[ 12.]]]])]
Outputs clients: [[GpuElemwise{Composite{((i0 * i1) + (i2 * sqr((-i3))))}}[(0, 1)](GpuDimShuffle{x,x,x,x}.0, <CudaNdarrayType(float32, 4D)>, GpuDimShuffle{x,x,x,x}.0, GpuElemwise{Composite{(i0 + (i1 * (i2 / i3)))}}[(0, 2)].0), GpuElemwise{Composite{((i0 * i1) + (i2 * i3 * i4))}}[(0, 1)](GpuDimShuffle{x,x,x,x}.0, <CudaNdarrayType(float32, 4D)>, CudaNdarrayConstant{[[[[-1.]]]]}, GpuDimShuffle{x,x,x,x}.0, GpuElemwise{Composite{(i0 + (i1 * (i2 / i3)))}}[(0, 2)].0), GpuElemwise{Mul}[(0, 2)](CudaNdarrayConstant{[[[[-1.]]]]}, GpuDimShuffle{x,x,x,x}.0, GpuElemwise{Composite{(i0 + (i1 * (i2 / i3)))}}[(0, 2)].0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Actor and critic network:

WINDOW_LENGTH = 1
INPUT_SHAPE = (84,84)
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE

#actor_input = Input(shape=env.observation_space.shape)

if K.image_dim_ordering() == 'tf':
    # (width, height, channels)
    actor_input = Input(shape=(input_shape[1],input_shape[2],input_shape[0]))
    critic_input = Input(shape=(input_shape[1],input_shape[2],input_shape[0]))
elif K.image_dim_ordering() == 'th':
    # (channels, width, height)
    actor_input = Input(shape=(input_shape[0],input_shape[1],input_shape[2]))
    critic_input = Input(shape=(input_shape[0],input_shape[1],input_shape[2]))
else:
    raise RuntimeError('Unknown image_dim_ordering.')

x = None
x = Convolution2D(32, 8, 8, subsample=(4, 4))(actor_input)
x = Activation('relu')(x)
x = Convolution2D(64, 4, 4, subsample=(2, 2))(x)
x = Activation('relu')(x)
x = Convolution2D(64, 3, 3, subsample=(1, 1))(x)
x = Activation('relu')(x)
x = Flatten()(x)
x = Dense(256)(x)
x = Activation('relu')(x)
actor_mean_output = Dense(nb_actions)(x)
actor_mean_output = Activation('linear')(actor_mean_output)
actor_variance_output = Dense(nb_actions)(x)
actor_variance_output = Activation('softplus')(actor_variance_output)
actor = Model(input=actor_input, output=[actor_mean_output, actor_variance_output])
print(actor.summary())

#critic_input = Input(shape=env.observation_space.shape)
x = None
x = Convolution2D(32, 8, 8, subsample=(4, 4))(critic_input)
x = Activation('relu')(x)
x = Convolution2D(64, 4, 4, subsample=(2, 2))(x)
x = Activation('relu')(x)
x = Convolution2D(64, 3, 3, subsample=(1, 1))(x)
x = Activation('relu')(x)
x = Flatten()(x)
x = Dense(512)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(input=critic_input, output=x)
print(critic.summary())

I understand that the error is from the actor network - which is compiled as a theano function( I am using theano backend) on line:165 in a3c.py. I have checked all the input dimensions - but I am still not able to find out why I am getting the dimension mismatch error.

Before passing inputs in line:282, I checked the input dimensions, they are:

#osbv,Rs,Vs,actions (Environment: Breakout-v0 (nb_actions: 6))
[(1, 84, 84, 1), (1,), (1,), (1, 6)]

This also seems consistent with the network input (correct me if im wrong).

Here is a summary of the networks:

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_1 (InputLayer)             (None, 84, 84, 1)     0                                            
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 20, 20, 32)    2080        input_1[0][0]                    
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 20, 20, 32)    0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 9, 9, 64)      32832       activation_1[0][0]               
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 9, 9, 64)      0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 7, 7, 64)      36928       activation_2[0][0]               
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 7, 7, 64)      0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 3136)          0           activation_3[0][0]               
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           803072      flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 256)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 6)             1542        activation_4[0][0]               
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 6)             1542        activation_4[0][0]               
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 6)             0           dense_2[0][0]                    
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 6)             0           dense_3[0][0]                    
====================================================================================================
Total params: 877996
____________________________________________________________________________________________________
None
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_2 (InputLayer)             (None, 84, 84, 1)     0                                            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 20, 20, 32)    2080        input_2[0][0]                    
____________________________________________________________________________________________________
activation_7 (Activation)        (None, 20, 20, 32)    0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 9, 9, 64)      32832       activation_7[0][0]               
____________________________________________________________________________________________________
activation_8 (Activation)        (None, 9, 9, 64)      0           convolution2d_5[0][0]            
____________________________________________________________________________________________________
convolution2d_6 (Convolution2D)  (None, 7, 7, 64)      36928       activation_8[0][0]               
____________________________________________________________________________________________________
activation_9 (Activation)        (None, 7, 7, 64)      0           convolution2d_6[0][0]            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 3136)          0           activation_9[0][0]               
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 512)           1606144     flatten_2[0][0]                  
____________________________________________________________________________________________________
activation_10 (Activation)       (None, 512)           0           dense_4[0][0]                    
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 1)             513         activation_10[0][0]              
____________________________________________________________________________________________________
activation_11 (Activation)       (None, 1)             0           dense_5[0][0]                    
====================================================================================================
Total params: 1678497
____________________________________________________________________________________________________
None

cdqn_pendulum example, it doesn't learn

In the cdqn_pendulum example, it doesn't really converge:

Training for 50000 steps ...
Interval 1 (0 steps performed)
10000/10000 [==============================] - 66s - reward: -7.3978
50 episodes - episode_reward: -1479.559 [-1726.207, -826.488] - loss: 5.522 - mean_absolute_error: 0.455 - mean_q: -31.851

Interval 2 (10000 steps performed)
10000/10000 [==============================] - 58s - reward: -7.0521
50 episodes - episode_reward: -1410.411 [-1656.417, -892.262] - loss: 39.408 - mean_absolute_error: 1.249 - mean_q: -91.166

Interval 3 (20000 steps performed)
10000/10000 [==============================] - 58s - reward: -6.8738
50 episodes - episode_reward: -1374.752 [-1657.503, -720.458] - loss: 100.004 - mean_absolute_error: 1.925 - mean_q: -145.790

Interval 4 (30000 steps performed)
10000/10000 [==============================] - 60s - reward: -7.4725
50 episodes - episode_reward: -1494.492 [-1658.286, -836.675] - loss: 199.980 - mean_absolute_error: 2.489 - mean_q: -199.093

Interval 5 (40000 steps performed)
10000/10000 [==============================] - 61s - reward: -7.2715
done, took 304.161 seconds
Episode 1: reward: -1583.770, steps: 200
Episode 2: reward: -1489.440, steps: 200
Episode 3: reward: -1163.561, steps: 200
Episode 4: reward: -1493.087, steps: 200
Episode 5: reward: -1652.922, steps: 200
Episode 6: reward: -1520.152, steps: 200
Episode 7: reward: -1649.608, steps: 200
Episode 8: reward: -1491.755, steps: 200
Episode 9: reward: -1516.406, steps: 200
Episode 10: reward: -1391.604, steps: 200

keras-rl: latest master commit
keras: 1.1.1 (Theano backend)
Python: 3.4

Is there a way to use this for multi-agent environments?

Can keras-rl be modified to work with multi-agent environments? For example could you teach the ghosts in PacMan to cooperate and catch the PacMan?

the agent can't play well

Hi, all:
I use the example to train pong-agent, but it can't play well.
than I use the example and use the weights from matthiasplappert/keras-rl-weights. the breakout-v0 also can't play well, like random.
so, how to use this example?

thx.

Functional API Support

Hi,

This is fantastic--thanks so much for putting this together and out into the community!

I'm playing around with creating a custom Environment--and I'm trying to use Keras' Functional API that performs weight sharing, multiple inputs, merging of layers, etc. If I have a CNN to estimate Q that takes multiple inputs, typically in keras you give a list of numpy arrays as input, and that works fine. I was trying that here and it seems to not support that, is that correct? I'm looking in dqn.py where the states are collected in forward() and then processed in batch in compute_q_values().

Do you know of a simple fix I can make to support this paradigm with your code? Essentially giving a list of numpy arrays as input just as you might do with the Functional API in keras? Or is that more complicated than it seems?

Thanks again!

Logging bug

If you call Agent.fit or Agent.test repeatedly, the callbacks argument gets filled up with (redundant) callbacks.

To reproduce, change the last line of examples/dqn_cartpole.py from

dqn.test(env, nb_episodes=5, visualize=True)

print 'first'
dqn.test(env, nb_episodes=5, visualize=False)

print 'second'
dqn.test(env, nb_episodes=5, visualize=False)

print 'third'
dqn.test(env, nb_episodes=5, visualize=False)

yielding

first
Episode 1: reward: 35.000, steps: 35
Episode 2: reward: 18.000, steps: 18
Episode 3: reward: 17.000, steps: 17
Episode 4: reward: 31.000, steps: 31
Episode 5: reward: 21.000, steps: 21
second
Episode 1: reward: 18.000, steps: 18
Episode 1: reward: 18.000, steps: 18
Episode 2: reward: 13.000, steps: 13
Episode 2: reward: 13.000, steps: 13
Episode 3: reward: 13.000, steps: 13
Episode 3: reward: 13.000, steps: 13
Episode 4: reward: 36.000, steps: 36
Episode 4: reward: 36.000, steps: 36
Episode 5: reward: 13.000, steps: 13
Episode 5: reward: 13.000, steps: 13
third
Episode 1: reward: 33.000, steps: 33
Episode 1: reward: 33.000, steps: 33
Episode 1: reward: 33.000, steps: 33
Episode 2: reward: 18.000, steps: 18
Episode 2: reward: 18.000, steps: 18
Episode 2: reward: 18.000, steps: 18
Episode 3: reward: 27.000, steps: 27
Episode 3: reward: 27.000, steps: 27
Episode 3: reward: 27.000, steps: 27
Episode 4: reward: 19.000, steps: 19
Episode 4: reward: 19.000, steps: 19
Episode 4: reward: 19.000, steps: 19
Episode 5: reward: 25.000, steps: 25
Episode 5: reward: 25.000, steps: 25
Episode 5: reward: 25.000, steps: 25

CDQN reset_states error

The CDQN agent doesn't have self.model so the call to self.model.reset_states() super call errors out. This fix seems to work:

diff --git a/rl/agents/dqn.py b/rl/agents/dqn.py
index b607689..31d0966 100644
--- a/rl/agents/dqn.py
+++ b/rl/agents/dqn.py
@@ -410,7 +410,8 @@ class ContinuousDQNAgent(DQNAgent):
         self.combined_model.save_weights(filepath, overwrite=overwrite)

     def reset_states(self):
-        super(ContinuousDQNAgent, self).reset_states()
+        self.recent_action = None
+        self.recent_observations = deque(maxlen=self.window_length)
         if self.compiled:
             self.combined_model.reset_states()

CDQN nan actions

I just ported the CDQN pendulum agent to an environment of mine. When I run the model, the first few steps contain valid values but the rest are nan. I am not sure what is up here. Let me know what I can provide to help debug.

> python .\dqn.py -d C:\Users\Ryan\Dropbox\cmu-sf\deepsf-data2 --visualize
Using Theano backend.
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
flatten_1 (Flatten)              (None, 60)            0           flatten_input_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 16)            976         flatten_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 16)            0           dense_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 16)            272         activation_1[0][0]
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 16)            0           dense_2[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 16)            272         activation_2[0][0]
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 16)            0           dense_3[0][0]
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 1)             17          activation_3[0][0]
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 1)             0           dense_4[0][0]
====================================================================================================
Total params: 1537
____________________________________________________________________________________________________
None
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
flatten_2 (Flatten)              (None, 60)            0           flatten_input_2[0][0]
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 16)            976         flatten_2[0][0]
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 16)            0           dense_5[0][0]
____________________________________________________________________________________________________
dense_6 (Dense)                  (None, 16)            272         activation_5[0][0]
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 16)            0           dense_6[0][0]
____________________________________________________________________________________________________
dense_7 (Dense)                  (None, 16)            272         activation_6[0][0]
____________________________________________________________________________________________________
activation_7 (Activation)        (None, 16)            0           dense_7[0][0]
____________________________________________________________________________________________________
dense_8 (Dense)                  (None, 2L)            34          activation_7[0][0]
____________________________________________________________________________________________________
activation_8 (Activation)        (None, 2L)            0           dense_8[0][0]
====================================================================================================
Total params: 1554
____________________________________________________________________________________________________
None
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
observation_input (InputLayer)   (None, 1, 60L)        0
____________________________________________________________________________________________________
action_input (InputLayer)        (None, 2L)            0
____________________________________________________________________________________________________
flatten_3 (Flatten)              (None, 60)            0           observation_input[0][0]
____________________________________________________________________________________________________
merge_1 (Merge)                  (None, 62)            0           action_input[0][0]
                                                                   flatten_3[0][0]
____________________________________________________________________________________________________
dense_9 (Dense)                  (None, 32)            2016        merge_1[0][0]
____________________________________________________________________________________________________
activation_9 (Activation)        (None, 32)            0           dense_9[0][0]
____________________________________________________________________________________________________
dense_10 (Dense)                 (None, 32)            1056        activation_9[0][0]
____________________________________________________________________________________________________
activation_10 (Activation)       (None, 32)            0           dense_10[0][0]
____________________________________________________________________________________________________
dense_11 (Dense)                 (None, 32)            1056        activation_10[0][0]
____________________________________________________________________________________________________
activation_11 (Activation)       (None, 32)            0           dense_11[0][0]
____________________________________________________________________________________________________
dense_12 (Dense)                 (None, 3L)            99          activation_11[0][0]
____________________________________________________________________________________________________
activation_12 (Activation)       (None, 3L)            0           dense_12[0][0]
====================================================================================================
Total params: 4227
____________________________________________________________________________________________________
None
Training for 21820000 steps ...
[-29.43209839  41.64512253]
[-26.13952446  42.74395752]
[-29.95537758  54.30570602]
[-28.84783554  35.84109497]
[-26.03454971  31.98110199]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]
[ nan  nan]

Run OK in Pycharm but I got an error in console

Hi, I am using the same conda environment for both console and Pycharm project.
In Pycharm, the pendulum example runs without any trouble. However, when I try to run it remotely from console, it outputs an error.

$ python examples/ddpg_pendulum.py
Using TensorFlow backend.
Traceback (most recent call last):
File "examples/ddpg_pendulum.py", line 8, in
from rl.agents import DDPGAgent
ImportError: No module named 'rl'

I am sure this must be a trouble with my setup but, please, may you give me an idea what the problem might be?

Thanks!

LSTM input layer possible?

Is there a theoretical reason why LSTM couldn't be used as an input layer to a DQN model or is it just a limitation of the implementation? I have been working with an environment where the input could be treated as a sequence of states but up until now I have just been flattening this input. I have a supervised learning keras model with an LSTM input layer that is able to interact with my environment, but when I try using the same model in the DQN agent I get an error about my input dimensions.

Get the best policy

Hi, after training is done, I can test the actor corresponding with the last policy.
However, I am observing that the episode-reward for some intermediate steps is much better than the policy at the last stage of the training phase.
I will go on and tuning the parameters (and debugging my environment) but, just in case, is it possible to track and store the best policy that has provided the highest max-value during the training phase (kind of earlier stopping)?
Thanks!

Progress of A3C branch

Is this experimental because it doesn't work, or because it hasn't been tested well enough yet? If the former, what's remaining to do and would you accept PRs?

ddpg_pendulum example, NaN reward

In the ddpg_pendulum example, at some point the reward becomes NaN:

Training for 1000000 steps ...
Interval 1 (0 steps performed)
10000/10000 [==============================] - 45s - reward: -6.6935
50 episodes - episode_reward: -1338.704 [-1778.884, -888.810] - loss: 0.804 - mean_absolute_error: 0.539 - mean_q: -31.124

Interval 2 (10000 steps performed)
10000/10000 [==============================] - 44s - reward: -3.2394
50 episodes - episode_reward: -647.876 [-1498.992, -128.003] - loss: 1.600 - mean_absolute_error: 1.143 - mean_q: -70.293

Interval 3 (20000 steps performed)
10000/10000 [==============================] - 44s - reward: -1.6994
50 episodes - episode_reward: -339.873 [-954.513, -2.126] - loss: 3.513 - mean_absolute_error: 1.635 - mean_q: -74.608

Interval 4 (30000 steps performed)
10000/10000 [==============================] - 44s - reward: -0.6625
50 episodes - episode_reward: -132.497 [-334.801, -0.941] - loss: 6.544 - mean_absolute_error: 2.044 - mean_q: -59.374

Interval 5 (40000 steps performed)
10000/10000 [==============================] - 52s - reward: nan
50 episodes - episode_reward: nan [nan, nan] - loss: 6.731 - mean_absolute_error: 2.056 - mean_q: -49.225

Interval 6 (50000 steps performed)
 1255/10000 [==>...........................] - ETA: 39s - reward: nan

keras-rl: latest master commit
keras: 1.1.1 (Theano backend)
Python: 3.4

123 use keras-rl to my easy game

hi, matthiasplappert:
I think keras-rl is powerful and useful lib, gym also have many game, but if I want to use my own game, what function should I support?
like:
env = mygame();
model.fit(env,.....)

I am a China student , so maybe my english is bad.

target_model_update

Sorry to make an issue just to ask a question. The DQN pendulum example sets target_model_update to a value between 0 and 1 (i.e. .002) whereas the DQN atari example sets target_model_update to 10000. This param seems to have a huge impact on how my models learn but I don't fully understand why. Could you give a brief explanation for this param and maybe also address why do you used different values in these two examples? Thanks.

Training Pong with DQN

Hi there,
awesome repository by the way. I tried to train the DQN on the game of Pong but the results seem to be pretty bad:
Testing for 10 episodes ...
Episode 1: reward: -21.000, steps: 764
Episode 2: reward: -20.000, steps: 844
Episode 3: reward: -21.000, steps: 764
Episode 4: reward: -21.000, steps: 764
Episode 5: reward: -21.000, steps: 764
Episode 6: reward: -21.000, steps: 764
Episode 7: reward: -21.000, steps: 764
Episode 8: reward: -21.000, steps: 764
Episode 9: reward: -21.000, steps: 820
Episode 10: reward: -21.000, steps: 764

This is after 1.2M steps. It this normal?

Thanks

Training Breakout - low performance using default parameters

I trained a network to play Breakout using your default parameters:
python dqn_atari.py

However, the resulting network did not have good performance. During testing, the paddle stayed mostly in the right corner with only occasional moves towards the center. The scores were very low. I assume your pretrained weights for Breakout have good perfomance. Did you use the same network structure and number of training steps as the default settings? If not, what settings should I use to effectively train on Breakout?

Also, I am using TensorFlow with a Titan X Pascal GPU. The training time shown below are only about 20% faster than training using just a CPU. In this situation the GPU is not giving the expected 10-100x speed boost vs the CPU. Is this because the agent/network processing time is only a small fraction of the whole reinforcement learning cycle? Can you verify that the training times below are about what you would expect using a GPU?

Thanks,
Allen

Training for 1750000 steps ...
Interval 1 (0 steps performed)
10000/10000 [==============================] - 136s - reward: 0.0062
Interval 2 (10000 steps performed)
10000/10000 [==============================] - 102s - reward: 0.0058
Interval 3 (20000 steps performed)
10000/10000 [==============================] - 95s - reward: 0.0062
Interval 4 (30000 steps performed)
10000/10000 [==============================] - 98s - reward: 0.0061
Interval 5 (40000 steps performed)
10000/10000 [==============================] - 102s - reward: 0.0064
Interval 6 (50000 steps performed)
10000/10000 [==============================] - 235s - reward: 0.0057 - loss: 0.0010 - mean_absolute_error: 0.0019 - mean_q: 0.0130 - mean_eps: 0.9522
Interval 7 (60000 steps performed)
10000/10000 [==============================] - 236s - reward: 0.0058 - loss: 0.0011 - mean_absolute_error: 0.0020 - mean_q: 0.0146 - mean_eps: 0.9432
Interval 8 (70000 steps performed)
10000/10000 [==============================] - 241s - reward: 0.0064 - loss: 0.0010 - mean_absolute_error: 0.0019 - mean_q: 0.0215 - mean_eps: 0.9342
Interval 9 (80000 steps performed)
10000/10000 [==============================] - 243s - reward: 0.0056 - loss: 0.0010 - mean_absolute_error: 0.0019 - mean_q: 0.0265 - mean_eps: 0.9252
Interval 10 (90000 steps performed)
10000/10000 [==============================] - 241s - reward: 0.0056 - loss: 9.9363e-04 - mean_absolute_error: 0.0018 - mean_q: 0.0326 - mean_eps: 0.9162

...

Interval 175 (1740000 steps performed)
10000/10000 [==============================] - 244s - reward: 0.0187 - loss: 0.0035 - mean_absolute_error: 0.0059 - mean_q: 0.4626 - mean_eps: 0.1000
done, took 41743.480 seconds
Episode 1: reward: 1.000, steps: 53
Episode 2: reward: 1.000, steps: 53
Episode 3: reward: 1.000, steps: 53
Episode 4: reward: 1.000, steps: 53
Episode 5: reward: 1.000, steps: 53
Episode 6: reward: 1.000, steps: 53
Episode 7: reward: 1.000, steps: 53
Episode 8: reward: 1.000, steps: 53
Episode 9: reward: 1.000, steps: 53
Episode 10: reward: 1.000, steps: 53

no rewards when running different game using dqn_atari.py

Hello,

I have encountered a issue when I try to test out a different gym game called "Enduro-v0" using dqn_atari.py. For some reasons I only get zero reward for both episode reward and mean reward. The only part of the code that I change is the following:

 parser.add_argument('--env-name', type=str, default='Enduro-v0')

[Enhancement] Multiple output support

In keras-rl there are checks such as

if hasattr(actor.output, '__len__') and len(actor.output) > 1:
    raise ValueError('Actor "{}" has more than one output. DDPG expects an actor that has a single output.'.format(actor))

However we could have multiple outputs. For example, in this implementation https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html there are 3 outputs (Steering, Acceleration, Brake).

In the paper about NAF https://arxiv.org/abs/1603.00748 they talk about a multi-target test (multi-target reacher).

Do you plan to add multiple output support to keras-rl agents?

How to train agent to play my own (2048) game?

Hello,

I am working on AI for the game 2048 (https://gabrielecirulli.github.io/2048/). This library looks like fantastic so I want to ask you what I need to do in order to train agent (some discrete one, probably DQN)?
I have implemented 2048 using NumPy (https://github.com/gorgitko/pygames/tree/master/2048), so I should implement some API for your library? I see in your examples that you are training the OpenAI Gym games so I just need to implement same API in my 2048 game class?

Thank you!

Move process_action to core.Agent

Bug report: dqn_atari.py won't run, ValueError exception called

allen:~/Tutorials/KerasRL/Doc$ python dqn_atari.py
Using TensorFlow backend.
[2016-09-21 22:54:24,033] Making new env: Breakout-v0
Traceback (most recent call last):
File "dqn_atari.py", line 70, in
model.add(Convolution2D(32, 8, 8, subsample=(4, 4), input_shape=(WINDOW_LENGTH,) + INPUT_SHAPE))
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/models.py", line 276, in add
layer.create_input_layer(batch_input_shape, input_dtype)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 370, in create_input_layer
self(x)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 514, in call
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 572, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 149, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/layers/convolutional.py", line 466, in call
filter_shape=self.W_shape)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 1579, in conv2d
x = tf.nn.conv2d(x, kernel, strides, padding=padding)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2262, in create_op
set_shapes_for_outputs(ret)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1702, in set_shapes_for_outputs
shapes = shape_func(op)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/common_shapes.py", line 246, in conv2d_shape
padding)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/common_shapes.py", line 184, in get2d_conv_output_size
(row_stride, col_stride), padding_type)
File "/Users/Allen/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/common_shapes.py", line 149, in get_conv_output_size
"Filter: %r Input: %r" % (filter_size, input_size))
ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (4, 84)

examples/dqn_cartpole.py: Index out of range

Issue: Executing python dqn_cartpole.py didn't work. However, python examples/dqn_cartpole.py does.
Issue:

python examples/dqn_cartpole.py 
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
[2016-07-29 17:35:02,351] Making new env: CartPole-v0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce 940MX
major: 5 minor: 0 memoryClockRate (GHz) 1.2415
pciBusID 0000:02:00.0
Total memory: 2.00GiB
Free memory: 1.50GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:717] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940MX, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 512.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.00GiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.00GiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:73] Allocating 1.31GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:83] GPU 0 memory begins at 0x501040000 extends to 0x554ada000
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
flatten_1 (Flatten)              (None, 4)             0           flatten_input_1[0][0]            
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 16)            80          flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 16)            0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 16)            272         activation_1[0][0]               
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 16)            0           dense_2[0][0]                    
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 16)            272         activation_2[0][0]               
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 16)            0           dense_3[0][0]                    
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 2)             34          activation_3[0][0]               
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 2)             0           dense_4[0][0]                    
====================================================================================================
Total params: 658
____________________________________________________________________________________________________
None
Training for 50000 steps ...
6
    25/50000: episode: 1, duration: 1.184s, episode steps: 25, steps per second: 21, episode reward: 25.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.360 [0.000, 1.000], mean observation: 0.020 [-1.598, 2.171], loss: 0.552051, mean_absolute_error: 0.521051, mean_q: 0.173832
Traceback (most recent call last):
  File "examples/dqn_cartpole.py", line 44, in <module>
    dqn.fit(env, nb_steps=50000, visualize=True, verbose=2)
  File "/home/moose/GitHub/keras-rl/rl/core.py", line 86, in fit
    metrics = self.backward(reward, terminal=done)
  File "/home/moose/GitHub/keras-rl/rl/agents/dqn.py", line 152, in backward
    experiences = self.memory.sample(self.batch_size, self.window_length)
  File "/home/moose/GitHub/keras-rl/rl/memory.py", line 77, in sample
    state1.append(np.copy(state1[-1]))
IndexError: list index out of range

Episode reward does not include gamma

Hi, thanks for this nice piece of work!

I am having some problems while learning my own environment. After training, the actions are always zero for any state.

I've just started checking the code and have seen that the total episode reward does not include the discount factor gamma, i.e., in core.py lines 110 or 239 (methods fit() and test()) it writes
episode_reward += reward
However, I understand that we should count the number of steps, t, and then write
episode_reward += pow(gamma, self.step)*reward

Is this correct?

Add Python 3 compatibility

Just changed a print function-call and use range instead of xrange to make keras-rl run on both python 2.7 and 3.5 for me. See pull requests https://github.com/matthiasplappert/keras-rl/pull/4, https://github.com/matthiasplappert/keras-rl/pull/5, https://github.com/matthiasplappert/keras-rl/pull/6, https://github.com/matthiasplappert/keras-rl/pull/7.

Simplest possible example

I'm trying to get intuition about how these models work, and I was trying to train a model to learn the (seemingly) simplest possible policy:

https://github.com/openai/gym/blob/master/gym/envs/debugging/one_round_deterministic_reward.py

Adapting from the cartpole example, I tried:

model = Sequential()
model.add(Dense(2, input_shape=(1,)))
model.add(Dense(nb_actions, activation='sigmoid'))

memory = SequentialMemory(limit=50000)
policy = BoltzmannQPolicy()
dqn = DQNAgent(
    model=model, 
    nb_actions=nb_actions, 
    memory=memory, 
    nb_steps_warmup=10,
    target_model_update=1e-2, 
    policy=policy
)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

dqn.fit(env, nb_steps=10000, visualize=False, verbose=2)
dqn.test(env, nb_episodes=10, visualize=False)

This doesn't really seem to learn the optimal policy of action = 1. Any thoughts as to what I'm doing wrong here?

Thanks
Ben

AttributeError: 'CallbackList' object has no attribute '_set_model' - Keras 1.2.1

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question in the Keras-RL Google group or join the Keras-RL Gitter channel and ask there instead of filing a GitHub issue.

Thank you!

[v] Check that you are up-to-date with the master branch of Keras-RL. You can update with:
pip install git+git://github.com/matthiasplappert/keras-rl.git --upgrade --no-deps
[v] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[v] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short). If you report an error, please include the error message and the backtrace.

While running example python ddpg_pendulum.py

With Keras==1.2.1

I got this error:

Traceback (most recent call last):
  File "ddpg_pendulum.py", line 64, in <module>
    agent.fit(env, nb_steps=50000, visualize=True, verbose=1, nb_max_episode_steps=200)
  File "C:\Miniconda2\envs\numerai\lib\site-packages\rl\core.py", line 40, in fit
    callbacks._set_model(self)
AttributeError: 'CallbackList' object has no attribute '_set_model'

Using Keras<1.2 works fine.

Thanks.

BoltzmannQPolicy probabilities don't sum to 1

After running for a period of time, any model I have run using the BoltzmannQPolicy() throws a 'probabilities don't sum to 1' error when this line is called:

action = np.random.choice(range(nb_actions), p=probs)

You already norm the probs so they should add to 1. Let me know if there is anything I can do to help debug this.

`clone_model` method doesn't follow keras API?

in util.py there is the line

clone = Sequential.from_config(config,custom_objects)

I can run the examples if I remove the second argument. But otherwise throws an error that 3 arguments are passed instead of 2. Should these be there?

Installation with pip

Running pip search keras-rl suggests that the package is pip-installable, but when I try to do it I get a "No matching distribution found" error. Is it supposed to work? If not, it would be very convenient.

Add stochastic policy gradient

It would be nice to have stochastic policy gradient in this. It would help implement things like
https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf

Upgrade_genuine-

Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.43 (KHTML, like Gecko) Version/10.0 Mobile/14A5322e Safari/602.1

Parameterization of P for CDQN

I was getting NaNs in my cdqn like https://github.com/matthiasplappert/keras-rl/issues/28 . I suspect the below exponential and I'm trying out an alternative parameterization that seems to be working. Squaring might have slightly less chance of exploding than exponential but means less control over small values. Any thoughts about the best way to parameterize a positive definite matrix?

I'm doing some experimentation at https://github.com/bstriner/keras-rl/tree/cdqn . I can put together a PR if anyone can confirm the best way to go.

Any other good suggestions on troubleshooting NaN values? Rescaling the observations and rewards seems to be a popular suggestion.

Current code:

                def fn(x, L_acc, LT_acc):
                    x_ = K.zeros((self.nb_actions, self.nb_actions))
                    x_ = T.set_subtensor(x_[np.tril_indices(self.nb_actions)], x)
                    diag = K.exp(T.diag(x_) + K.epsilon())
                    x_ = T.set_subtensor(x_[np.diag_indices(self.nb_actions)], diag)
                    return x_, x_.T

                outputs_info = [
                    K.zeros((self.nb_actions, self.nb_actions)),
                    K.zeros((self.nb_actions, self.nb_actions)),
                ]
                results, _ = theano.scan(fn=fn, sequences=L_flat, outputs_info=outputs_info)
                L, LT = results
                P = K.batch_dot(L, LT)

Alternative parameterization:

                def fn(x_tri, _):
                    x_ = K.zeros((self.nb_actions, self.nb_actions))
                    x_ = T.set_subtensor(x_[np.tril_indices(self.nb_actions)], x_tri)
                    x_ = T.dot(x_, x_.T) + (K.epsilon()*T.eye(self.nb_actions))
                    return x_

                outputs_info = [
                    K.zeros((self.nb_actions, self.nb_actions))
                ]
                P, _ = theano.scan(fn=fn, sequences=L_flat, outputs_info=outputs_info)

Cheers!

(I only sketched out a TF implementation but the theano one is working.)

@matthiasplappert As a side note, I think in the current version of the code, the +K.epsilon should be outside of the K.exp.

diag = K.exp(T.diag(x_)) + K.epsilon()

Check that you are up-to-date with the master branch of Keras-RL. You can update with:
pip install git+git://github.com/matthiasplappert/keras-rl.git --upgrade --no-deps
Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short). If you report an error, please include the error message and the backtrace.

FR: more in docs on which does what

I loved the blurb "DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions)" and that you clearly say which one is best for which type of task.

Can you do the same for the rest? It really helps first time users pick the right one to start with.

model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

Would you please give just a short explanation about this Flattening used in dqn_cartpole.py?

model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

Maybe it it due to a generalization for various problem environments but it is not easy to figure it out.

Compared to the corresponding part in dqn_atari.py, it seem that (1,) in the code corresponds to the size of the time window.
So, it seems that the input shape (1,4) = (1,)+(4,) in this case is flattened to be a vector of 4 elements (something like array([1,1,1,1])) through the flattening operation.

Add new policies

Softmax policy and gaussian policy are some pretty common policies.

Softmax policy is where it's sampling using the softmax output probabilities.

Gaussian policy is for real value actions. So the network outputs the mean and stddev (or stddev is a fixed constant) and outputs are sampled from the gaussian distribution.

DDPG example gives Theano AssertionArror with dropout

If I take the DDPG example and add a dropout layer to either the actor or critic model, I get an AssertionError from Theano (but not Tensorflow):

Traceback (most recent call last):
  File "ddpg_pendulum.py", line 61, in <module>
    agent.compile(Adam(lr=.001, clipnorm=1.), metrics=['mae'])
  File "/python/lib/python3.4/site-packages/rl/agents/ddpg.py", line 138, in compile
    grads = T.jacobian(combined_output.flatten(), self.actor.trainable_weights)
  File "/python/lib/python3.4/site-packages/theano/gradient.py", line 1815, in jacobian
    ("Scan has returned a list of updates. This should not "
AssertionError: Scan has returned a list of updates. This should not happen! Report this to theano-users (also include the script that generated the error)

I know it says it's a Theano issue, but I thought I'd give you a chance to look into it first. Here is the code to reproduce it (only two lines have been added):

import numpy as np
import gym

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Flatten, Input, merge, Dropout
from keras.optimizers import Adam

from rl.agents import DDPGAgent
from rl.memory import SequentialMemory
from rl.random import OrnsteinUhlenbeckProcess


ENV_NAME = 'Pendulum-v0'
gym.undo_logger_setup()


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
assert len(env.action_space.shape) == 1
nb_actions = env.action_space.shape[0]

# Next, we build a very simple model.
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape))
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dropout(0.1))                 # Causes crash with Theano
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
print(actor.summary())

action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = merge([action_input, flattened_observation], mode='concat')
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dropout(0.1)(x)                 # Causes crash with Theano
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(input=[action_input, observation_input], output=x)
print(critic.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=100000, window_length=1)
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=.15, mu=0., sigma=.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=100, nb_steps_warmup_actor=100,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=.001, clipnorm=1.), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
agent.fit(env, nb_steps=50000, visualize=True, verbose=1, nb_max_episode_steps=200)

# After training is done, we save the final weights.
agent.save_weights('ddpg_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
agent.test(env, nb_episodes=5, visualize=True, nb_max_episode_steps=200)

I'm on the latest master of everything.

$ python -V; pip freeze | egrep -i 'keras|theano'
Python 3.4.3+
Keras==1.2.2
keras-rl==0.2.0
Theano==0.8.2

DDPG example fails adding BatchNormalization with Tensorflow

If I take the DDPG example and add a BatchNormalization layer to either the actor or critic model, I get a TypeError with Tensorflow (but not Theano):

Using TensorFlow backend.
Traceback (most recent call last):
  File "ddpg_pendulum.py", line 27, in <module>
    actor.add(BatchNormalization())         # Fails with Tensorflow
  File "/python/lib/python3.4/site-packages/keras/models.py", line 332, in add
    output_tensor = layer(self.outputs[0])
  File "/python/lib/python3.4/site-packages/keras/engine/topology.py", line 572, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/python/lib/python3.4/site-packages/keras/engine/topology.py", line 635, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/python/lib/python3.4/site-packages/keras/engine/topology.py", line 166, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/python/lib/python3.4/site-packages/keras/layers/normalization.py", line 133, in call
    broadcast_running_mean = K.reshape(self.running_mean, broadcast_shape)
  File "/python/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 1463, in reshape
    return tf.reshape(x, shape)
  File "/python/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2448, in reshape
    name=name)
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 493, in apply_op
    raise err
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 490, in apply_op
    preferred_dtype=default_dtype)
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/tensor_util.py", line 441, in make_tensor_proto
    tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
  File "/python/lib/python3.4/site-packages/tensorflow/python/framework/tensor_util.py", line 441, in <listcomp>
    tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
  File "/python/lib/python3.4/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
    (bytes_or_text,))
TypeError: Expected binary or unicode string, got 1

Code to reproduce it (only two lines have been added):

import numpy as np
import gym

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Flatten, Input, merge, BatchNormalization
from keras.optimizers import Adam

from rl.agents import DDPGAgent
from rl.memory import SequentialMemory
from rl.random import OrnsteinUhlenbeckProcess


ENV_NAME = 'Pendulum-v0'
gym.undo_logger_setup()


# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
assert len(env.action_space.shape) == 1
nb_actions = env.action_space.shape[0]

# Next, we build a very simple model.
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape))
actor.add(BatchNormalization())         # Fails with Tensorflow
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dense(16))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
print(actor.summary())

action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = merge([action_input, flattened_observation], mode='concat')
x = BatchNormalization()(x)            # Fails with Tensorflow
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(input=[action_input, observation_input], output=x)
print(critic.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=100000, window_length=1)
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=.15, mu=0., sigma=.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=100, nb_steps_warmup_actor=100,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=.001, clipnorm=1.), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
agent.fit(env, nb_steps=50000, visualize=True, verbose=1, nb_max_episode_steps=200)

# After training is done, we save the final weights.
agent.save_weights('ddpg_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
agent.test(env, nb_episodes=5, visualize=True, nb_max_episode_steps=200)

I'm on the latest master of everything.

$ python -V; pip freeze | egrep -i 'keras|theano'
Python 3.4.3+
Keras==1.2.2
keras-rl==0.2.0
Theano==0.8.2

error occured rl.agents.dqn.py method process_state_batch()

when a parse a bit complex list in this method, this line of code "batch = np.array(batch)" throws a exception
the list_a contains 60 list_bs, and each list_b contain a ndarray with shape(1,2) and another ndarray with shape(1,7,60,1)

I know I can change the state given by my customized enviornment, in order to fit the program. But, the task I am dealing with does not give me such rights

so...
how can I change this "batch = np.array(batch)" ???

adapting ddpg agent for the Atari environment

I'd like to adapt the ddpg agent for environments other than Pendulum-v0. One problem is that the environment action is Discrete(6) for the Atari environment but a numpy array for the Pendulum-v0 environment. Is there a way to avoid statements which use the shape of the action variable?

File "/lib/python2.7/site-packages/keras_rl-0.2.0rc1-py2.7.egg/rl/agents/ddpg.py", line 222, in select_action
assert noise.shape == action.shape

AttributeError: 'TimeLimit' object has no attribute '_action_set'

Hi,

I am new to keras-rl, and when I tried to run the example "python examples/dqn_atari.py", I got this error:

...
Training for 1750000 steps ...
Interval 1 (0 steps performed)
Traceback (most recent call last):
File "examples/dqn_atari.py", line 125, in
dqn.fit(env, callbacks=callbacks, nb_steps=1750000, log_interval=10000)
File "/Users/Chen/pyml/lib/python2.7/site-packages/rl/core.py", line 111, in fit
observation, r, done, info = env.step(action)
File "/Users/Chen/proj/DRL/gym/gym/core.py", line 110, in step
observation, reward, done, info = self._step(action)
File "examples/dqn_atari.py", line 60, in _step
action = env._action_set[a]
AttributeError: 'TimeLimit' object has no attribute '_action_set'

If I replace the env._action_set[a] with env.env._action_set[a] (in fact, all occurrences of env inside the _step() function, see:
https://github.com/matthiasplappert/keras-rl/blob/master/examples/dqn_atari.py#L60

then it appears working all fine again, and it appears that the Wrapper class did not expose those functions from the "inner" env by default. So did I miss anything when I run it, or is this something you have encountered before?

Thanks
Chen.

Unable to learn simple catch game

I've made custom environment, where the fruit is falling and you control a paddle to catch it: https://github.com/hmate9/gym-catch/blob/master/gym_catch/envs/catch_env.py

I've tried to use keras-rl to reimplement this: https://gist.github.com/EderSantana/c7222daa328f0e885093

The same game, catching a fruit, and their implementation finds a good model in a couple of minutes which catches nearly 100% of the time.

Here is the code for learning with keras-rl that I wrote: https://gist.github.com/hmate9/49758ee1117ae55616f45d72186834a5

The code with keras-rl does not converge, and does not even seem to be better than random even after running for hours.

Does anyone know why this is? Did I write the environment wrong or am I using keras-rl wrong?

Your answer is greatly appreciated, I have not been able to solve this for a day now.

[Enhancement] Prioritized Experience Replay

I've seen that Prioritized Experience Replay is on the to-do list.

Some Python implementations:

Rank-based: https://github.com/Damcy/prioritized-experience-replay
Proportional: https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py

Access to the complete history of states-actions-rewards in an episode

Hi,
The method agent.test returns history, which include the reward per episode.
However, I would like to log the entire sequence of state-action-reward per episode.

I have seen a pass statement in rl/callbacks.py (line 27):
def on_step_end(self, step, logs={}): pass
So that I understand that this functionality is not implemented yet, right? What would be the cleanest way to do it?