tensorforce / tensorforce Goto Github PK

View Code? Open in Web Editor NEW

3.3K 144.0 539.0 28.08 MB

Tensorforce: a TensorFlow library for applied reinforcement learning

License: Apache License 2.0

Python 99.92% Shell 0.08%

reinforcement-learning tensorflow deep-reinforcement-learning tensorflow-library tensorforce control system-control

tensorforce's Introduction

Tensorforce: a TensorFlow library for applied reinforcement learning

This project is not maintained any longer!

Introduction

Tensorforce is an open-source deep reinforcement learning framework, with an emphasis on modularized flexible library design and straightforward usability for applications in research and practice. Tensorforce is built on top of Google's TensorFlow framework and requires Python 3.

Tensorforce follows a set of high-level design choices which differentiate it from other similar libraries:

Modular component-based design: Feature implementations, above all, strive to be as generally applicable and configurable as possible, potentially at some cost of faithfully resembling details of the introducing paper.
Separation of RL algorithm and application: Algorithms are agnostic to the type and structure of inputs (states/observations) and outputs (actions/decisions), as well as the interaction with the application environment.
Full-on TensorFlow models: The entire reinforcement learning logic, including control flow, is implemented in TensorFlow, to enable portable computation graphs independent of application programming language, and to facilitate the deployment of models.

Quicklinks

Documentation and update notes
Contact and Gitter channel
Benchmarks and projects using Tensorforce
Roadmap and contribution guidelines
GitHub Sponsors and Liberapay

Installation

A stable version of Tensorforce is periodically updated on PyPI and installed as follows:

pip3 install tensorforce

To always use the latest version of Tensorforce, install the GitHub version instead:

git clone https://github.com/tensorforce/tensorforce.git
pip3 install -e tensorforce

Note on installation on M1 Macs: At the moment Tensorflow, which is a core dependency of Tensorforce, cannot be installed on M1 Macs directly. Follow the "M1 Macs" section in the documentation for a workaround.

Environments require additional packages for which there are setup options available (ale, gym, retro, vizdoom, carla; or envs for all environments), however, some require additional tools to be installed separately (see environments documentation). Other setup options include tfa for TensorFlow Addons and tune for HpBandSter required for the tune.py script.

Note on GPU usage: Different from (un)supervised deep learning, RL does not always benefit from running on a GPU, depending on environment and agent configuration. In particular for environments with low-dimensional state spaces (i.e., no images), it is hence worth trying to run on CPU only.

Quickstart example code

from tensorforce import Agent, Environment

# Pre-defined or custom environment
environment = Environment.create(
    environment='gym', level='CartPole', max_episode_timesteps=500
)

# Instantiate a Tensorforce agent
agent = Agent.create(
    agent='tensorforce',
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=10000,
    update=dict(unit='timesteps', batch_size=64),
    optimizer=dict(type='adam', learning_rate=3e-4),
    policy=dict(network='auto'),
    objective='policy_gradient',
    reward_estimation=dict(horizon=20)
)

# Train for 300 episodes
for _ in range(300):

    # Initialize episode
    states = environment.reset()
    terminal = False

    while not terminal:
        # Episode timestep
        actions = agent.act(states=states)
        states, terminal, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminal, reward=reward)

agent.close()
environment.close()

Command line usage

Tensorforce comes with a range of example configurations for different popular reinforcement learning environments. For instance, to run Tensorforce's implementation of the popular Proximal Policy Optimization (PPO) algorithm on the OpenAI Gym CartPole environment, execute the following line:

python3 run.py --agent benchmarks/configs/ppo.json --environment gym \
    --level CartPole-v1 --episodes 100

For more information check out the documentation.

Features

Network layers: Fully-connected, 1- and 2-dimensional convolutions, embeddings, pooling, RNNs, dropout, normalization, and more; plus support of Keras layers.
Network architecture: Support for multi-state inputs and layer (block) reuse, simple definition of directed acyclic graph structures via register/retrieve layer, plus support for arbitrary architectures.
Memory types: Simple batch buffer memory, random replay memory.
Policy distributions: Bernoulli distribution for boolean actions, categorical distribution for (finite) integer actions, Gaussian distribution for continuous actions, Beta distribution for range-constrained continuous actions, multi-action support.
Reward estimation: Configuration options for estimation horizon, future reward discount, state/state-action/advantage estimation, and for whether to consider terminal and horizon states.
Training objectives: (Deterministic) policy gradient, state-(action-)value approximation.
Optimization algorithms: Various gradient-based optimizers provided by TensorFlow like Adam/AdaDelta/RMSProp/etc, evolutionary optimizer, natural-gradient-based optimizer, plus a range of meta-optimizers.
Exploration: Randomized actions, sampling temperature, variable noise.
Preprocessing: Clipping, deltafier, sequence, image processing.
Regularization: L2 and entropy regularization.
Execution modes: Parallelized execution of multiple environments based on Python's multiprocessing and socket.
Optimized act-only SavedModel extraction.
TensorBoard support.

By combining these modular components in different ways, a variety of popular deep reinforcement learning models/features can be replicated:

Q-learning: Deep Q-learning, Double-DQN, Dueling DQN, n-step DQN, Normalised Advantage Function (NAF)
Policy gradient: vanilla policy-gradient / REINFORCE, Actor-critic and A3C, Proximal Policy Optimization, Trust Region Policy Optimization, Deterministic Policy Gradient

Note that in general the replication is not 100% faithful, since the models as described in the corresponding paper often involve additional minor tweaks and modifications which are hard to support with a modular design (and, arguably, also questionable whether it is important/desirable to support them). On the upside, these models are just a few examples from the multitude of module combinations supported by Tensorforce.

Environment adapters

Arcade Learning Environment, a simple object-oriented framework that allows researchers and hobbyists to develop AI agents for Atari 2600 games.
CARLA, is an open-source simulator for autonomous driving research.
OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms which supports teaching agents everything from walking to playing games like Pong or Pinball.
OpenAI Retro, lets you turn classic video games into Gym environments for reinforcement learning and comes with integrations for ~1000 games.
OpenSim, reinforcement learning with musculoskeletal models.
PyGame Learning Environment, learning environment which allows a quick start to Reinforcement Learning in Python.
ViZDoom, allows developing AI bots that play Doom using only the visual information.

Support, feedback and donating

Please get in touch via mail or on Gitter if you have questions, feedback, ideas for features/collaboration, or if you seek support for applying Tensorforce to your problem.

If you want to support the Tensorforce core team (see below), please also consider donating: GitHub Sponsors or Liberapay.

Core team and contributors

Tensorforce is currently developed and maintained by Alexander Kuhnle.

Earlier versions of Tensorforce (<= 0.4.2) were developed by Michael Schaarschmidt, Alexander Kuhnle and Kai Fricke.

The advanced parallel execution functionality was originally contributed by Jean Rabault (@jerabaul29) and Vincent Belus (@vbelus). Moreover, the pretraining feature was largely developed in collaboration with Hongwei Tang (@thw1021) and Jean Rabault (@jerabaul29).

The CARLA environment wrapper is currently developed by Luca Anzalone (@luca96).

We are very grateful for our open-source contributors (listed according to Github, updated periodically):

Islandman93, sven1977, Mazecreator, wassname, lefnire, daggertye, trickmeyer, mkempers, mryellow, ImpulseAdventure, janislavjankov, andrewekhalel, HassamSheikh, skervim, beflix, coord-e, benelot, tms1337, vwxyzjn, erniejunior, Deathn0t, petrbel, nrhodes, batu, yellowbee686, tgianko, AdamStelmaszczyk, BorisSchaeling, christianhidber, Davidnet, ekerazha, gitter-badger, kborozdin, Kismuz, mannsi, milesmcc, nagachika, neitzal, ngoodger, perara, sohakes, tomhennigan.

Cite Tensorforce

Please cite the framework as follows:

@misc{tensorforce,
  author       = {Kuhnle, Alexander and Schaarschmidt, Michael and Fricke, Kai},
  title        = {Tensorforce: a TensorFlow library for applied reinforcement learning},
  howpublished = {Web page},
  url          = {https://github.com/tensorforce/tensorforce},
  year         = {2017}
}

If you use the parallel execution functionality, please additionally cite it as follows:

@article{rabault2019accelerating,
  title        = {Accelerating deep reinforcement learning strategies of flow control through a multi-environment approach},
  author       = {Rabault, Jean and Kuhnle, Alexander},
  journal      = {Physics of Fluids},
  volume       = {31},
  number       = {9},
  pages        = {094105},
  year         = {2019},
  publisher    = {AIP Publishing}
}

If you use Tensorforce in your research, you may additionally consider citing the following paper:

@article{lift-tensorforce,
  author       = {Schaarschmidt, Michael and Kuhnle, Alexander and Ellis, Ben and Fricke, Kai and Gessert, Felix and Yoneki, Eiko},
  title        = {{LIFT}: Reinforcement Learning in Computer Systems by Learning From Demonstrations},
  journal      = {CoRR},
  volume       = {abs/1808.07903},
  year         = {2018},
  url          = {http://arxiv.org/abs/1808.07903},
  archivePrefix = {arXiv},
  eprint       = {1808.07903}
}

tensorforce's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger pranavsreedhar apptsunami ouya-bytes red7hj lql0716 yushu-liu benjamesbabala chagge xieguobin 4575759ww kismuz yukunix leduckhc liyanonline ml-lab lanyastar ericustc aliaspeng g-wang ericschles epistimos lkh-1 tomstewart89 williamd4112 trickmeyer tarrysingh mydp2017 shuolongbj matthewwilfred nagachika winnerineast leehoy spandyie ngchc jameslinus wjssx newebug hustercn midasc hxl1990 suqi docvaughan davidtranno1 arnolmi rl-gan-vision-privacy-finance-projects shafiahmed 0ad4ai ferankliu ashbt hedgefair 19ai ja1r0 mryellow wassname islandman93 collector-m zfw1226 waxz lrozo createamind dabbr afxalz adamstelmaszczyk newlulu nikoliazekter koryako lefnire snowfeet ddfan ilovescienceandpython mcdavid109 win2cs elephann rspadim 0xssoul aleksanw tolgaok damenus davidurpani tms1337 aravindsrinivas alexleethinker lucianopalmeida noah003 lipixun gitrekm chjq201410695 emillarsson yellowbee686 kyubyong la-ml zhouyonglong ashokpant davidnet maozhiqiang yifantian shailender1 augusyan leowalkling

tensorforce's Issues

Load and test a learned policy

Hi, I was wondering if there's currently a straightforward way to load a saved policy and run that policy with an environment without training updates, or do I have to write my own runner for this purpose. Thanks.

Configs should not change when passing to an agent

Currently, a configuration contains additional default and internal values after the initialization of an agent. This should not be the case, instead the agent could, for instance, create a copy of the configuration before modification.

simple_q_agent example

When I try to run the simple_q_agent.py script I get the following error:

  File "/Users/aidanrocke/Desktop/open_ai_solutions/tensor_force/examples/simple_dqn.py", line 214, in main
    runner.run(max_episodes, max_timesteps, episode_finished=episode_finished)

  File "/Users/aidanrocke/tensorforce/tensorforce/execution/runner.py", line 58, in run
    action = self.agent.get_action(processed_state, self.episode)

  File "/Users/aidanrocke/tensorforce/tensorforce/agents/memory_agent.py", line 94, in get_action
    action = self.model.get_action(*args, **kwargs)

AttributeError: 'NoneType' object has no attribute 'get_action'

Implement generic natural policy gradient model

Logging consistency with external library.

Using tensorforce as a library which itself uses the logging module creates a conflict on logging handlers (multiple handlers added).

(Example of) support for multi-valued Box actions?

When trying to run the TRPO agent on BipedalWalker, as follows, I run into:

foo$ PYTHONPATH=. python examples/openai_gym.py BipedalWalker-v2 -D -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json
....
File "/../tensorforce/tensorforce/environments/openai_gym.py", line 67, in execute
 state, reward, terminal, _ = self.gym.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
 return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/wrappers/time_limit.py", line 36, in _step
 observation, reward, done, info = self.env.step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 99, in step
 return self._step(action)
File "/usr/local/lib/python2.7/dist-packages/gym/envs/box2d/bipedal_walker.py", line 372, in _step
 self.joints[1].motorSpeed     = float(SPEED_KNEE    * np.sign(action[1]))
IndexError: list index out of range

Looking at OpenAIGym.actions, it doesn't seem to unravel that environment's Box(4) action space as wanted - am I just failing to configure the agent as required, or are such action spaces not handled right now?

Cannot install

on docker, this just hangs:

Step 7/8 : RUN pip install tensorforce[tf] -e .
 ---> Running in 55d5d05d7049
Obtaining file:///code/tensorforce

Gaussian distribution parameters ignored

If I create a distribution with Gaussian(distribution=(0, 0.1)), the parameters (0, 0.1) are ignored and instead the result from Gaussian.create_tf_operations is used. At the very least I would expect the parameters that I pass to Gaussian to be used as initial guesses for the parameterization.

In general the initial variance of the policy cannot be specified right now. In practice that's an important tuning parameter. The easiest way to do this might be to allow users to pass an instance of the distribution as part of the config, rather than the class.

Lastly, the sigmoid rescaling of the policy within Gaussian seems hacky. What if I already provide a custom network that has properly scaled actions? In that case I wouldn't want another sigmoid nonlinearity to be applied. I think this would better fit into the network_builder.

Fix sampling instability in prioritised replay

Default configuration locations

Some default configurations are in separate (python files), some are still in the models/agents. Needs to be cleared up.

Write docs for distributed agent, explain different modes

Example runner for DeepMind lab

Add an example for DeepMind lab.

result logging and policy saving

Hi, maybe I'm missing something but where do you save the various training metrics (returns, entropy, etc) and is there a mechanism to save the trained model or do we have to implement that. Thanks!

Add check in replay memory to prevent batch_size > memory_size

Write unit test to verify every model runs until update

Make Gaussian initial std configurable

From #26:

'Another thing I noticed in continuous state spaces is that the standard deviation of the Gaussian (exploration) noise is not parameterized. That seems like a bad default for this kind of on-policy method. It's an easy fix since the required code in the Gaussian class is just commented out, but enabling this does not seem possible without low-level adjustments at the moment.'

Create default network configs for different environments

Move configs to separate folder and create overview of configs corresponding to specific environments/papers

Allow custom distribution implementations

Currently, only Gaussian and Categorical are possible. This new feature, however, requires to somewhere specify the distribution per action.

Docker container integration

Integrate internal states for LSTM policies

API: allow update from external batch

Agent API needs to allow to pass in a batch of experiences to update from - for use cases where data is collected in a way where passing it sample by sample to TensorForce isn't needed/creates too much I/O.

Future plans and DDPG implementation

Hi,

Can you share some plans about roadmap and what algorithms will be added? In particular are there any plans about DDPG implementation with recent improvements: https://arxiv.org/abs/1704.03073 and https://arxiv.org/abs/1707.01495 ?

update docs for 0.2 - new API and new parameter names

Optionally return model features

get_action should optionally return action means and stds

Min/max values for continuous actions

Currently it is possible to define min_value and max_value for continuous actions, but this value is never actually used. Part of the problem is that the so far only continuous distribution Gaussian does not naturally bound its possible samples.

Always create tf.saver

Options of strategy about experience sampling at Replay.get_batch.

Current Replay.get_batch return the samples as continuous range of original sequence of experiences.
I'd like to get batch data whose each sample is picked up from memory at random to get rid of bias of samples.
I would like to add an option to change the strategy about sample in Replay.get_batch.

See #59

Quick start example raises TypeError

Fresh install. Command from http://tensorforce.readthedocs.io/en/latest/#quick-start:

python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_agent.json -n examples/configs/trpo_network.json

Gives:

[2017-07-24 22:51:58,560] Making new env: CartPole-v0
Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 70, in main
    agent = agents[args.agent](config=agent_config)
  File "/home/tensorforce/tensorforce/agents/batch_agent.py", line 50, in __init__
    super(BatchAgent, self).__init__(config)
  File "/home/tensorforce/tensorforce/agents/agent.py", line 143, in __init__
    self.model = self.__class__.model(config)
  File "/home/tensorforce/tensorforce/models/trpo_model.py", line 54, in __init__
    super(TRPOModel, self).__init__(config)
  File "/home/tensorforce/tensorforce/models/policy_gradient_model.py", line 81, in __init__
    self.baseline = Baseline.from_config(config=config.baseline)
  File "/home/tensorforce/tensorforce/core/baselines/baseline.py", line 43, in from_config
    predefined=tensorforce.core.baselines.baselines
  File "/home/tensorforce/tensorforce/util.py", line 123, in get_object
    return obj(**full_kwargs)
TypeError: __init__() takes at least 2 arguments (1 given)

obj from util.py:119 is <class 'tensorforce.core.baselines.mlp.MLPBaseline'>, kwargs is None and full_kwargs is {}.

MLPBaseline's __init__ indeed takes at least 2 arguments.

flake8: F821 undefined name in docs/m2r.py

https://github.com/reinforceio/tensorforce/blob/master/docs/m2r.py#L513

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./docs/m2r.py:513:43: F821 undefined name 'SafeString'
                              (self.name, SafeString(path)))
                                          ^

./docs/m2r.py:516:43: F821 undefined name 'ErrorString'
                              (self.name, ErrorString(error)))
                                          ^

./docs/m2r.py:523:43: F821 undefined name 'ErrorString'
                              (self.name, ErrorString(error)))
                                          ^

Clean up dtype configuration

Ideally, we would want to allow to specify float precisions everywhere. Currently, we only use this in a few classes and inconsistently.

Documentation for epsilon decay

It's not the linear decay based on the remaining I was expecting.

self.epsilon -= ((self.epsilon - self.epsilon_final) / self.epsilon_timesteps) * timestep

So over 100 steps that takes about 30-40 to get "close" to epsilon_final.

Potential for optional decay mode.

Calling finalize() on the graph

The runner should probably call finalize on the graph, but if the runner is not used, we should also call finalize internally somewhere.

ReplayMemory: add utility to set memory to array.

Currently, memories only support adding observation by observation. This is impractical for importing larger amounts of data.

Exploration could be max until learning starts

Perhaps first_update could be copied into config['exploration'].

Would be used like this:

https://github.com/reinforceio/tensorforce/pull/56/files#diff-3a20a353542fac38371e6c75dccfe10fR31

Similarly for EpsilonDecay:

self.epsilon -= ((self.epsilon - self.epsilon_final) / (self.epsilon_timesteps - self.first_update)) * (timestep - self.first_update)

edit: With a min and/or max to ensure timestep - first_update doesn't throw things off.

Incorrect number of columns computing lower triangular matrix in NAF agent

In naf_model.py, lines 71-79:

     if num_actions > 1:
                offset = num_actions
                l_columns = list()
                for zeros, size in enumerate(xrange(num_actions - 1, 0, -1), 1):
                    column = tf.pad(l_entries[:, offset: offset + size], ((0, 0), (zeros, 0)))
                    l_columns.append(column)
                    offset += size
                l_matrix += tf.stack(l_columns, 1)

I believe the number of columns given to tf.stack is incorrect (one too few). I think there needs to be an extra column, e.g. by adding something like:

l_columns.append(tf.zeros_like(l_columns[0]))

Is this correct?

The error I'm getting is:

ValueError: Dimensions must be equal, but are 59 and 58 for 'training_outputs/add' (op: 'Add') with input shapes: [?,59,59], [?,58,59].

from the line

l_matrix += tf.stack(l_columns, 1)

Investigate occasional NaN in TRPO

TRPO occasionally fails to produce a robust update with the langrange multiplier being None, need to check if gradient computation can produce None

Adjust random agent to new state/action API

Add log level as config param

Check TRPO with multiple (continuous) actions

There seems to be a problem with some gradients being undefined in the case of multiple (continuous) actions for TRPO.

Prioritized replay index out-of-range

Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 112, in main
    runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
  File "/home/yellow/work/tf/tensorforce/tensorforce/execution/runner.py", line 144, in run
    self.agent.observe(reward=reward, terminal=terminal)
  File "/home/yellow/work/tf/tensorforce/tensorforce/agents/dqn_agent.py", line 94, in observe
    super(DQNAgent, self).observe(reward=reward, terminal=terminal)
  File "/home/yellow/work/tf/tensorforce/tensorforce/agents/memory_agent.py", line 84, in observe
    internal=self.current_internal
  File "/home/yellow/work/tf/tensorforce/tensorforce/core/memories/prioritized_replay.py", line 55, in add_observation
    priority, _ = self.observations.pop(self.positive_priority_index)
IndexError: pop index out of range

setup.py and tensorflow with/without gpu

Hi,

Unless the goal is not to support tensorflow with gpu, I would recommend to move the tensorflow requirement to "extra_requires". I have seen this pattern in both sonnet and tensor2tensor.

For example:

extra_packages = {
'tensorflow': ['tensorflow>=1.0.1'],
'tensorflow with gpu': ['tensorflow-gpu>=1.0.1']
}

install_requires=[
'numpy',
'six',
'scipy',
'pillow',
'pytest'
]

setup_requires=['numpy', 'recommonmark', 'mistune']

setup(name='tensorforce',
version='0.2',
description='Reinforcement learning for TensorFlow',
url='http://github.com/reinforceio/tensorforce',
author='reinforce.io',
author_email='[email protected]',
license='Apache 2.0',
packages=['tensorforce'],
install_requires=install_requires,
extra_requires=extra_packages,
setup_requires=setup_requires,
zip_safe=False)

Regards,

Pedro

PS Will spend my weekend understanding tensorforce. Great work!

Issues with multiple continuous actions

Hi,
first of all, thanks for the hard work that is going into this project. You are saving me a ton of work.
Second, I encountered some strange behavior when trying to define an agent with multiple continuous actions. All code below was run in a Jupyter notebook with Anaconda and Python 3.5:

#Configuration, adapted from config in readme
config = Configuration(
    batch_size=100,
    states=dict(shape=(4,), type='float'),
    actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
                opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
    network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)

# Create a TRPO agent
agent = TRPOAgent(config=config)

This code crashes with the trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-70-b10cf4edc1d7> in <module>()
      1 # Create a VPGA agent
----> 2 agent = TRPOAgent(config=config)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/batch_agent.py in __init__(self, config)
     48     def __init__(self, config):
     49         config.default(BatchAgent.default_config)
---> 50         super(BatchAgent, self).__init__(config)
     51         self.batch_size = config.batch_size
     52         self.batch = None

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in __init__(self, config)
    141         self.actions_config = config.actions
    142 
--> 143         self.model = self.__class__.model(config)
    144 
    145         self.episode = 0

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in __init__(self, config)
     52     def __init__(self, config):
     53         config.default(TRPOModel.default_config)
---> 54         super(TRPOModel, self).__init__(config)
     55 
     56         self.override_line_search = config.override_line_search

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/policy_gradient_model.py in __init__(self, config)
     81             self.baseline = Baseline.from_config(config=config.baseline)
     82 
---> 83         super(PolicyGradientModel, self).__init__(config)
     84 
     85         # advantage estimation

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in __init__(self, config)
    118                 scope = scope_context.__enter__()
    119 
--> 120             self.create_tf_operations(config)
    121 
    122             if config.distributed:

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in create_tf_operations(self, config)
    117 
    118             gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119             gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
    120 
    121             self.flat_variable_helper = FlatVarHelper(variables)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/trpo_model.py in <listcomp>(.0)
    117 
    118             gradients = tf.gradients(fixed_kl_divergence, variables)
--> 119             gradient_vector_product = [tf.reduce_sum(g * t) for (g, t) in zip(gradients, tangents)]
    120 
    121             self.flat_variable_helper = FlatVarHelper(variables)

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py in r_binary_op_wrapper(y, x)
    895   def r_binary_op_wrapper(y, x):
    896     with ops.name_scope(None, op_name, [x, y]) as name:
--> 897       x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name="x")
    898       return func(x, y, name=name)
    899 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
    649       name=name,
    650       preferred_dtype=preferred_dtype,
--> 651       as_ref=False)
    652 
    653 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    714 
    715         if ret is None:
--> 716           ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    717 
    718         if ret is NotImplemented:

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    174                                          as_ref=False):
    175   _ = as_ref
--> 176   return constant(v, dtype=dtype, name=name)
    177 
    178 

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
    163   tensor_value = attr_value_pb2.AttrValue()
    164   tensor_value.tensor.CopyFrom(
--> 165       tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    166   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
    167   const_tensor = g.create_op(

/Users/jannes/anaconda/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
    358   else:
    359     if values is None:
--> 360       raise ValueError("None values not supported.")
    361     # if dtype is provided, forces numpy array to be the type
    362     # provided if possible.

ValueError: None values not supported.

I tried different agents and encountered another strange behavior:

# Create a VPG agent
agent = VPGAgent(config=config)
state = np.array([1,2,3,4])
agent.act(state)

Crashes with:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-73-565d0bd87882> in <module>()
----> 1 agent.act(state)

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/agents/agent.py in act(self, state, deterministic)
    194 
    195         # model action
--> 196         self.current_action, self.next_internal = self.model.get_action(state=self.current_state, internal=self.current_internal, deterministic=deterministic)
    197 
    198         # exploration

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in get_action(self, state, internal, deterministic)
    219         fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
    220 
--> 221         feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
    222         feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
    223         feed_dict[self.deterministic] = deterministic

/Users/jannes/AnacondaProjects/tensorforce/tensorforce/models/model.py in <dictcomp>(.0)
    219         fetches.update({n: internal_output for n, internal_output in enumerate(self.internal_outputs)})
    220 
--> 221         feed_dict = {state_input: (state[name],) for name, state_input in self.state.items()}
    222         feed_dict.update({internal_input: (internal[n],) for n, internal_input in enumerate(self.internal_inputs)})
    223         feed_dict[self.deterministic] = deterministic

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

But when I redefine config, that is, I run

#Configuration, adapted from config in readme
config = Configuration(
    batch_size=100,
    states=dict(shape=(4,), type='float'),
    actions=dict(opt_a = dict(continuous=True, min_value = 0, max_value = 2),
                opt_b = dict(continuous=True, min_value = 0, max_value = 2)),
    network=layered_network_builder([dict(type='dense', size=50), dict(type='dense', size=50)])
)

again, it does not crash, but it occasionally outputs negative values for actions, although min_value = 0
{'opt_a': 0.28892395, 'opt_b': -0.10657883}
The PPO agent displays the same behavior as the VPG Agent.

I have tried this with many slightly different configurations, it seems to be a consistent issue.
Please let me know if you need any more code / info / data to reproduce the issue. Kindly, Jannes

Test coverage for quick start example

Since it's very easy to forget updating example configs after refactorings, quickstart test needs to be included into CI.

Create new example + runner for universe

Old one is using deprecated API and has been deleted, need a new example here.

Entropy regularization for policy gradient model

Generic distributed TF API

Other models should be be able to be used with the distributed runner where sensible.

Check state type in act()

Currently, not all iterables seem to work in agent.ac(), e.g. a tuple is expected and a nd-array of the correct shape can cause a tensorflow freeze without any error message.

Act needs to either:

Check incoming state type and shape against the given state config and raise an Error
Convert other types to tuples

Error running the example

[egor@host tensorforce]$ python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json -s /home/egor/Software/tensorforce/examples/output
[2017-07-19 00:35:06,206] Making new env: CartPole-v0
2017-07-19 00:35:06.922073: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922107: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922116: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922128: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 00:35:06.922135: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
[2017-07-19 00:35:06,977] Starting TRPOAgent for Environment 'OpenAIGym(CartPole-v0)'
[2017-07-19 00:35:08,600] Finished episode 50 after 12 timesteps
[2017-07-19 00:35:08,600] Episode reward: 12.0
[2017-07-19 00:35:08,600] Average of last 500 rewards: 2.332
[2017-07-19 00:35:08,600] Average of last 100 rewards: 11.66
Saving agent after episode 100
Traceback (most recent call last):
  File "examples/openai_gym.py", line 121, in <module>
    main()
  File "examples/openai_gym.py", line 112, in main
    runner.run(args.episodes, args.max_timesteps, episode_finished=episode_finished)
  File "/home/egor/Software/tensorforce/tensorforce/execution/runner.py", line 158, in run
    self.agent.save_model(self.save_path)
  File "/home/egor/Software/tensorforce/tensorforce/agents/agent.py", line 238, in save_model
    self.model.save_model(path)
  File "/home/egor/Software/tensorforce/tensorforce/models/model.py", line 274, in save_model
    self.saver.save(self.session, path)
AttributeError: 'NoneType' object has no attribute 'save'

Update quickstart docs

TRPO struggling with CartPole-v0 from quick start

After running python examples/quickstart.py (3000 episodes), the average reward from last 100 episodes is only 33.38. I would expect it to be close to the maximum, 200. Especially that it reached it couple of times before, e.g. on episode 1469, however later it deteriorates.

I also tried running it with provided command:

python examples/openai_gym.py CartPole-v0 -a TRPOAgent -c examples/configs/trpo_cartpole.json -n examples/configs/trpo_cartpole_network.json

However the results were also unsatisfactory:

[2017-07-24 23:58:58,363] Finished episode 4050 after 61 timesteps
[2017-07-24 23:58:58,363] Episode reward: 61.0
[2017-07-24 23:58:58,363] Average of last 500 rewards: 63.346
[2017-07-24 23:58:58,364] Average of last 100 rewards: 62.33

tensorforce / tensorforce Goto Github PK

tensorforce's Introduction

Tensorforce: a TensorFlow library for applied reinforcement learning

Introduction

Quicklinks

Table of content

Installation

Quickstart example code

Command line usage

Features

Environment adapters

Support, feedback and donating

Core team and contributors

Cite Tensorforce

tensorforce's People

Contributors

Stargazers

Watchers

Forkers

tensorforce's Issues

Recommend Projects

Recommend Topics

Recommend Org