Deep-Reinforcement-Learning-Hands-On-Second-Edition, published by Packt

License: MIT License

Python 19.08% Shell 0.71% Jupyter Notebook 78.41% TeX 1.05% OpenSCAD 0.23% C 0.40% Makefile 0.01% Kaitai Struct 0.12%

deep-reinforcement-learning-hands-on-second-edition's Introduction

Deep-Reinforcement-Learning-Hands-On-Second-Edition

Deep-Reinforcement-Learning-Hands-On-Second-Edition, published by Packt

Code branches

The repository is maintained to keep dependency versions up-to-date. This requires efforts and time to test all the examples on new versions, so, be patient.

The logic is following: there are several branches of the code, corresponding to major pytorch version code was tested. Due to incompatibilities in pytorch and other components, code in the printed book might differ from the code in the repo.

At the moment, there are the following branches available:

master: contains the code with the latest pytorch which was tested. At the moment, it is pytorch 1.7.
torch-1.3-book: code printed in the book with minor bug fixes. Uses pytorch=1.3 which is available only on conda repos.
torch-1.7: pytorch 1.7. This branch was tested and merged into master.

All the branches uses python 3.7, more recent versions weren't tested.

Dependencies installation

Anaconda is recommended for virtual environment creation. Once installed, the following steps will install everything needed:

change directory to book repository dir: cd Deep-Reinforcement-Learning-Hands-On-Second-Edition
create virtual environment with conda create -n rlbook python=3.7
activate it: conda activate rlbook
install pytorch (update CUDA version according to your CUDA): conda install pytorch==1.7 torchvision torchaudio cudatoolkit=10.2 -c pytorch
install rest of dependencies: pip install requirements.txt

Now you're ready to launch and experiment with examples!

deep-reinforcement-learning-hands-on-second-edition's People

Contributors

Stargazers

Watchers

Forkers

artificialnouveau gridl collector-m jayagupta678 mk9667245 afcarl doddaiah boston123456 shilon5 arifmudi airob qlycool nguyenducnhaty carljohanrehn samindaa abkonate awesome-archive superchong1987 sanmusunrise allensmile study2014 piaodangdang forgi86 borisneal leiloong flingjie diem389 curieuxjy alhaol aiedward aerogjy khuongnd isaac009 libiwei neonguru sayanddude andyhou965 thxsxth nguyendo24 mzxuan anon-111 yganjali barak28 mertclk evgenynedelko cshreyastech zhaoyang626 moshebeutel penghao1023 lalapo gryph66 zdenis208 yogendranyadav skylakexx dasmehdix sudhirsilwal23 manuel-arno-korfmann-webentwicklung swapnikapandey pareshppp stgoddv gayansamuditha shokuninsan haohaoxiao kalelv45 gauthiermartin s968 tonyabell walter-pixel philipp-scherer nikhilt1998 smegyery mhmdgaffar karthikjosyula gyasibaaye pglah heng-xiu ajay-2007 rahul-sindhu ngduyanhece minhmanminhman olegborisovv ericzhang2008 lukyanoff simonryu abasu644 visionlogic-ai ruiminhe yantiz renatoviolin trigaten weldonxu shicongisme shubh-gpuk kylinliu ufosky-ai gabrielflh chephilor shadedcreature fujiyuu75 rlepko

deep-reinforcement-learning-hands-on-second-edition's Issues

system requirements not stated

Please state in the README.md:

which platforms are supported (are they Linux, Windows, and Mac?)
which versions of python are supported

Chapter_3: 03_random_action_wrapper.py

Hi, I run this code as it is from the file, but I came across this error message, can someone help please? I'm running this on Python 3.9.7 and Ubuntu 20.04 LTS, with Nvidia RTX 3080 GPU.

Error Message

usage: pydevconsole.py [-h] [--cuda]
pydevconsole.py: error: unrecognized arguments: --mode=client --port=34433
Process finished with exit code 2

Original Code from the file

import random
import argparse
import cv2

import torch
import torch.nn as nn
import torch.optim as optim
from tensorboardX import SummaryWriter

import torchvision.utils as vutils

import gym
import gym.spaces

import numpy as np

log = gym.logger
log.set_level(gym.logger.INFO)

LATENT_VECTOR_SIZE = 100
DISCR_FILTERS = 64
GENER_FILTERS = 64
BATCH_SIZE = 16

# dimension input image will be rescaled
IMAGE_SIZE = 64

LEARNING_RATE = 0.0001
REPORT_EVERY_ITER = 100
SAVE_IMAGE_EVERY_ITER = 1000


class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        assert isinstance(self.observation_space, gym.spaces.Box)
        old_space = self.observation_space
        self.observation_space = gym.spaces.Box(
            self.observation(old_space.low),
            self.observation(old_space.high),
            dtype=np.float32)

    def observation(self, observation):
        # resize image
        new_obs = cv2.resize(
            observation, (IMAGE_SIZE, IMAGE_SIZE))
        # transform (210, 160, 3) -> (3, 210, 160)
        new_obs = np.moveaxis(new_obs, 2, 0)
        return new_obs.astype(np.float32)


class Discriminator(nn.Module):
    def __init__(self, input_shape):
        super(Discriminator, self).__init__()
        # this pipe converges image into the single number
        self.conv_pipe = nn.Sequential(
            nn.Conv2d(in_channels=input_shape[0], out_channels=DISCR_FILTERS,
                      kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS, out_channels=DISCR_FILTERS*2,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS*2),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 2, out_channels=DISCR_FILTERS * 4,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 4),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 4, out_channels=DISCR_FILTERS * 8,
                      kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(DISCR_FILTERS * 8),
            nn.ReLU(),
            nn.Conv2d(in_channels=DISCR_FILTERS * 8, out_channels=1,
                      kernel_size=4, stride=1, padding=0),
            nn.Sigmoid()
        )

    def forward(self, x):
        conv_out = self.conv_pipe(x)
        return conv_out.view(-1, 1).squeeze(dim=1)


class Generator(nn.Module):
    def __init__(self, output_shape):
        super(Generator, self).__init__()
        # pipe deconvolves input vector into (3, 64, 64) image
        self.pipe = nn.Sequential(
            nn.ConvTranspose2d(in_channels=LATENT_VECTOR_SIZE, out_channels=GENER_FILTERS * 8,
                               kernel_size=4, stride=1, padding=0),
            nn.BatchNorm2d(GENER_FILTERS * 8),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 8, out_channels=GENER_FILTERS * 4,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 4),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 4, out_channels=GENER_FILTERS * 2,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS * 2),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS * 2, out_channels=GENER_FILTERS,
                               kernel_size=4, stride=2, padding=1),
            nn.BatchNorm2d(GENER_FILTERS),
            nn.ReLU(),
            nn.ConvTranspose2d(in_channels=GENER_FILTERS, out_channels=output_shape[0],
                               kernel_size=4, stride=2, padding=1),
            nn.Tanh()
        )

    def forward(self, x):
        return self.pipe(x)


def iterate_batches(envs, batch_size=BATCH_SIZE):
    batch = [e.reset() for e in envs]
    env_gen = iter(lambda: random.choice(envs), None)

    while True:
        e = next(env_gen)
        obs, reward, is_done, _ = e.step(e.action_space.sample())
        if np.mean(obs) > 0.01:
            batch.append(obs)
        if len(batch) == batch_size:
            # Normalising input between -1 to 1
            batch_np = np.array(batch, dtype=np.float32) * 2.0 / 255.0 - 1.0
            yield torch.tensor(batch_np)
            batch.clear()
        if is_done:
            e.reset()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--cuda", default=False, action='store_true',
        help="Enable cuda computation")
    args = parser.parse_args()

    device = torch.device("cuda" if args.cuda else "cpu")
    envs = [
        InputWrapper(gym.make(name))
        for name in ('Breakout-v0', 'AirRaid-v0', 'Pong-v0')
    ]
    input_shape = envs[0].observation_space.shape

    net_discr = Discriminator(input_shape=input_shape).to(device)
    net_gener = Generator(output_shape=input_shape).to(device)

    objective = nn.BCELoss()
    gen_optimizer = optim.Adam(
        params=net_gener.parameters(), lr=LEARNING_RATE,
        betas=(0.5, 0.999))
    dis_optimizer = optim.Adam(
        params=net_discr.parameters(), lr=LEARNING_RATE,
        betas=(0.5, 0.999))
    writer = SummaryWriter()

    gen_losses = []
    dis_losses = []
    iter_no = 0

    true_labels_v = torch.ones(BATCH_SIZE, device=device)
    fake_labels_v = torch.zeros(BATCH_SIZE, device=device)

    for batch_v in iterate_batches(envs):
        # fake samples, input is 4D: batch, filters, x, y
        gen_input_v = torch.FloatTensor(
            BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1)
        gen_input_v.normal_(0, 1)
        gen_input_v = gen_input_v.to(device)
        batch_v = batch_v.to(device)
        gen_output_v = net_gener(gen_input_v)

        # train discriminator
        dis_optimizer.zero_grad()
        dis_output_true_v = net_discr(batch_v)
        dis_output_fake_v = net_discr(gen_output_v.detach())
        dis_loss = objective(dis_output_true_v, true_labels_v) + \
                   objective(dis_output_fake_v, fake_labels_v)
        dis_loss.backward()
        dis_optimizer.step()
        dis_losses.append(dis_loss.item())

        # train generator
        gen_optimizer.zero_grad()
        dis_output_v = net_discr(gen_output_v)
        gen_loss_v = objective(dis_output_v, true_labels_v)
        gen_loss_v.backward()
        gen_optimizer.step()
        gen_losses.append(gen_loss_v.item())

        iter_no += 1
        if iter_no % REPORT_EVERY_ITER == 0:
            log.info("Iter %d: gen_loss=%.3e, dis_loss=%.3e",
                     iter_no, np.mean(gen_losses),
                     np.mean(dis_losses))
            writer.add_scalar(
                "gen_loss", np.mean(gen_losses), iter_no)
            writer.add_scalar(
                "dis_loss", np.mean(dis_losses), iter_no)
            gen_losses = []
            dis_losses = []
        if iter_no % SAVE_IMAGE_EVERY_ITER == 0:
            writer.add_image("fake", vutils.make_grid(
                gen_output_v.data[:64], normalize=True), iter_no)
            writer.add_image("real", vutils.make_grid(
                batch_v.data[:64], normalize=True), iter_no)

Chapter 4 - 01 cartpole.py: ValueError: probabilities do not sum to 1

Hi,

While trying the cartpole example in the cross-entropy chapter, I came across that issue in the code:

Traceback (most recent call last):
  File "frombook.py", line 91, in <module>
    env, net, BATCH_SIZE)):
  File "frombook.py", line 44, in iterate_batches
    action = np.random.choice(len(act_probs), p=act_probs)
  File "mtrand.pyx", line 932, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities do not sum to 1

I tried to normalize probabilities:

        act_probs -= np.min(act_probs)
        act_probs /= np.sum(act_probs)

However, the probabilities are then just [1. 0.] or [0. 1.], it then fails with such error:

Traceback (most recent call last):
  File "cartpole_crossentropy.py", line 109, in <module>
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE, device)):
  File "cartpole_crossentropy.py", line 48, in iterate_batches
    action = rng.choice(len(act_probs), p=act_probs)
  File "_generator.pyx", line 644, in numpy.random._generator.Generator.choice
ValueError: probabilities contain NaN

Could it be an initialization error?

FYI, I'm using:
torch==1.7.0
numpy==1.19.4

Thanks!

Readme Typo

I think you want to use the command "pip install -r requirements.txt" to set up the environment. I get an error when using "pip install requirements.txt"

Problem in all code execution for CH.8

Hello,
Loving the book, but in Ch. 8 receive this error for all scripts:
Traceback (most recent call last):
File "C:/Users/hj/PycharmProjects/DeepLearning/01_dqn_basic.py", line 62, in
common.setup_ignite(engine, params, exp_source, NAME)
File "C:\Users\hj\PycharmProjects\DeepLearning\lib\common.py", line 158, in setup_ignite
handler.attach(engine)
File "C:\Users\hj\Anaconda3\envs\torch13\lib\site-packages\ptan\ignite.py", line 35, in attach
engine.register_events(*EpisodeEvents)
File "C:\Users\hj\Anaconda3\envs\torch13\lib\site-packages\ignite\engine\engine.py", line 223, in register_events
"Value at {} of event_names should be a str or EventEnum, but given {}".format(index, e)
TypeError: Value at 0 of event_names should be a str or EventEnum, but given EpisodeEvents.EPISODE_COMPLETED

I'm running torch 1.3, and Ignite 0.4.2, and PTAN 0.6

Chapter06, RuntimeError: gather_out_cpu(): Expected dtype int64 for index

The following error occurred from running $ python 02_dqn_pong.py:

...
9290: done 10 games, reward -20.200, eps 0.94, speed 651.18 f/s
Traceback (most recent call last):
  File ".\02_dqn_pong.py", line 189, in <module>
    loss_t = calc_loss(batch, net, tgt_net, device=device)
  File ".\02_dqn_pong.py", line 106, in calc_loss
    state_action_values = net(states_v).gather(
RuntimeError: gather_out_cpu(): Expected dtype int64 for index

So I changed as follows (see the line for actions_v), and the error is gone.

def calc_loss(batch, net, tgt_net, device="cpu"):
    states, actions, rewards, dones, next_states = batch

    states_v = torch.tensor(np.array(states, copy=False)).to(device)
    next_states_v = torch.tensor(np.array(next_states, copy=False)).to(device)
    actions_v = torch.tensor(actions, dtype=torch.int64).to(device)  # action index must be int64
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.BoolTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    with torch.no_grad():
        next_state_values = tgt_net(next_states_v).max(1)[0]
        next_state_values[done_mask] = 0.0
        next_state_values = next_state_values.detach()

    expected_state_action_values = next_state_values * GAMMA + \
                                   rewards_v
    return nn.MSELoss()(state_action_values,
                        expected_state_action_values)

Just in case you come to have the same error.

Chapter 3 - 03_atari_gan: Normalization problem

The image normalization in the batch iterator (iterate_batches) is not correct. Instead of

batch_np *= 2.0 / 255.0 - 1.0,

it should be

batch_np = 2.0 * batch_np / 255.0 - 1.0.

Chapter 14 cornell.py

Can you help me explain if movies and m_id not in movies:. Why movies not in movies ???

Reason for adding CNNs together in chapter 23?

Hi Max,

I own the first edition of your book, so maybe it's explained in the second version, but since I couldn't find an explanation I was wondering why you are calculating conv nets on the residuals (in chapter 23: lib/model.py)?

        v = self.conv_in(x)
        v = v + self.conv_1(v)
        v = v + self.conv_2(v)
        v = v + self.conv_3(v)
        v = v + self.conv_4(v)
        v = v + self.conv_5(v)
        ......

Why exactly is this done? To what extent do the residuals between conv nets add value? Does this significantly increase performance vs a single CNN with more features? Asking, because it takes a ton of time to run.

Thanks!

About the high-level RL libraries

As a beginner of RL, I don't thinks it is suitable to use the PTAN in the book. A general library (e.g. RLlib, Stable Baseline3) may be better.

[Chp. 11/12] Q(s,a) calculation

Hi, I am reading the book but there is something not very clear in chapter 12, in particular:

I haven't fully understood the difference in calculating Q(s,a) in Chapter12/02_pong_a2c and Chapter11/05_pong_pg.
In the former we use a NN to predict V(s_n), while in the latter we simply compute the sum of discounted reward for REWARD_STEPS steps. But why this difference? Maybe adding V(s,a) for the last state in the trajectory in the first case helps us to have a more accurate estimation of Q(s,a), while in Chapter11/05_pong_pg we simply have truncated the sum of rewards?

Thanks

Chapter 5

I am using the gymnasium library and the "FrozenLake-v1" environment.
I had to make some minor changes, like
new_state, reward, is_done, _, _ = self.env.step(action)

When running the "01_frozenlake_v_iteration" code in a Jupyter notebook I get the following error message:
TypeError Traceback (most recent call last)
Cell In[1], line 81
79 while True:
80 iter_no += 1
---> 81 agent.play_n_random_steps(100)
82 agent.value_iteration()
84 reward = 0.0

Cell In[1], line 25
23 action = self.env.action_space.sample()
24 new_state, reward, is_done, _, _ = self.env.step(action)
---> 25 self.rewards[(self.state, action, new_state)] = reward
26 self.transits[(self.state, action)][new_state] += 1
27 self.state = self.env.reset()
28 if is_done else new_state

TypeError: unhashable type: 'dict'

Any help will be much appreciated!

Chapter 8 - 01_dqn_basic.py

When I run the code, I get the below errors

AttributeError: module 'ignite' has no attribute 'EndOfEpisodeHandler'

how to solve the issue?

[Question] Book's publication and difference to the 1st edition

Hi @Shmuma,
recently found this repository and very excited to read the book behind these examples. Can you please provide some details on book's publication planning and the difference with the 1st edition ?
Thanks

Chapter 3 - 03/04_atari_gan - incorrectly ported from the first edition

In addition to issue 15, gen_input_v initialization is corrupted:

written:

        gen_input_v.normal_(0, 1)

should be:

        gen_input_v = gen_input_v.normal_(0, 1)

As a side note, I find the following implementation more readable:

        gen_input_v = (
            torch.FloatTensor(BATCH_SIZE, LATENT_VECTOR_SIZE, 1, 1)
                .normal_(0, 1)
                .to(device))

(just add extra parenthesis around to continue on the next line).

URL for the book

I didn't find the book on Packt, i only found the first edition !!

When I run train_model_conv(Chapter 10) in google colab I recieve an error

ValueError: Argument epoch_length should be defined if data is an iterator

Chapter 10 Stocks Trading: Getting Error in "class StocksEnv(gym.Env):"

Hello,
I am trying to run the code as it is in

Chapter10/lib/environ.py

but I get an error

TypeError Traceback (most recent call last)
/var/folders/_s/1nlhtzmx7776vpsb0yjhzdq00000gp/T/ipykernel_85372/2865701658.py in
----> 1 class StocksEnv(gym.Env):
2 metadata = {'render.modes': ['human']}
3 spec = EnvSpec("StocksEnv-v0")
4
5 def init(self, prices, bars_count=DEFAULT_BARS_COUNT,

/var/folders/_s/1nlhtzmx7776vpsb0yjhzdq00000gp/T/ipykernel_85372/2865701658.py in StocksEnv()
1 class StocksEnv(gym.Env):
2 metadata = {'render.modes': ['human']}
----> 3 spec = EnvSpec("StocksEnv-v0")
4
5 def init(self, prices, bars_count=DEFAULT_BARS_COUNT,

TypeError: init() missing 1 required positional argument: 'entry_point'

could you please help?

thanks

issue with ptan.experience.PrioritizedReplayBuffer

I tried to use ptan.experience.PrioritizedReplayBuffer and recieve such an error

This is how I implemented changes which are necessary

Then I tried to add missing argument with buffer.beta and beta, non of them worked

Monitor Wrapper not available

Like most of the code in the first two chapters the Monitor Wrapper is no longer availble. The code needs to be updated.

Here is an updated version.

import gymnasium  as gym
from gymnasium.wrappers import RecordVideo


if __name__ == "__main__":
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    env = gym.wrappers.RecordVideo(env, "records")

    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

    while True:
        action = env.action_space.sample()
        obs, reward, done, truncated, info = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

    print("Episode done in %d steps, total reward %.2f" % (
        total_steps, total_reward))
    env.close()
    env.env.close()

Chapter 4 Frozen Lake Non Slippery

This is the code that works for me. I also had to install PyGame and conda install nomkl

`
#!/usr/bin/env python3
import random
import gymnasium as gym
import gym
import gym.spaces
import gym.wrappers
import gym.envs.toy_text.frozen_lake
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

HIDDEN_SIZE = 128
BATCH_SIZE = 100
PERCENTILE = 30
GAMMA = 0.9

class DiscreteOneHotWrapper(gym.ObservationWrapper):
def init(self, env):
super(DiscreteOneHotWrapper, self).init(env)
assert isinstance(env.observation_space, gym.spaces.Discrete)
self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

def observation(self, observation):
    res = np.copy(self.observation_space.low)
    res[observation] = 1.0
    return res

class Net(nn.Module):
def init(self, obs_size, hidden_size, n_actions):
super(Net, self).init()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, n_actions)
)

def forward(self, x):
    return self.net(x)

Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

def iterate_batches(env, net, batch_size):
batch = []
episode_reward = 0.0
episode_steps = []
obs, _ = env.reset()
env.render()
sm = nn.Softmax(dim=1)
while True:
obs_v = torch.FloatTensor([obs])
act_probs_v = sm(net(obs_v))
act_probs = act_probs_v.data.numpy()[0]
action = np.random.choice(len(act_probs), p=act_probs)
next_obs, reward, is_done, _, _ = env.step(action)
episode_reward += reward
step = EpisodeStep(observation=obs, action=action)
episode_steps.append(step)
if is_done:
e = Episode(reward=episode_reward, steps=episode_steps)
batch.append(e)
episode_reward = 0.0
episode_steps = []
next_obs, _ = env.reset()
if len(batch) == batch_size:
yield batch
batch = []
obs = next_obs

def filter_batch(batch, percentile):
disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
reward_bound = np.percentile(disc_rewards, percentile)

train_obs = []
train_act = []
elite_batch = []
for example, discounted_reward in zip(batch, disc_rewards):
    if discounted_reward > reward_bound:
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))
        elite_batch.append(example)

return elite_batch, train_obs, train_act, reward_bound

if name == "main":
random.seed(12345)
env = gym.envs.toy_text.frozen_lake.FrozenLakeEnv(
is_slippery=False)
env = gym.make("FrozenLake-v1", render_mode='human')
env = gym.wrappers.TimeLimit(env, max_episode_steps=100)
env = DiscreteOneHotWrapper(env)
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)
writer = SummaryWriter(comment="-frozenlake-nonslippery")

full_batch = []
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
    full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
    if not full_batch:
        continue
    obs_v = torch.FloatTensor(obs)
    acts_v = torch.LongTensor(acts)
    full_batch = full_batch[-500:]

    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
        iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_mean", reward_mean, iter_no)
    writer.add_scalar("reward_bound", reward_bound, iter_no)
    if reward_mean > 0.8:
        print("Solved!")
        break
writer.close()

Unable to Install PTAN

I've just started reading your book and it reads really well. I've finished Chapter 2 and trying to install the prerequisites.
Everything is installed except PTAN where I get this error:

Downloading install-1.3.4-py3-none-any.whl (3.1 kB)
ERROR: Could not find a version that satisfies the requirement torch==1.3.0 (from ptan==0.6) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0) ERROR: No matching distribution found for torch==1.3.0 (from ptan==0.6)

Could you possibly upgrade the requirements?

Problem with ":" in file names on Windows machine in common.py

When running Chapter10 I get an error as seen below. It is because Windows don't like ":" in file names.

row 89 in common.py uses: datetime.now().isoformat(timespec='minutes')
and it returns: '2022-10-20T11:16'

I changed that line to this and it works: datetime.now().isoformat(timespec='minutes').replace(':', '')
It would be nice if it could be changed in the original code. (Or something else if there is a better way)

Regards,
Peter

Reading data\YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
Reading data\YNDX_150101_151231.csv
Read done, got 130566 rows, 104412 filtered, 0 open prices adjusted
Traceback (most recent call last):

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\record_writer.py", line 47, in directory_check
factory = REGISTERED_FACTORIES[prefix]

KeyError: 'runs/2022-10-20T11'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Anaconda3\envs\rlbook\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec
exec(code, globals, locals)

File "c:\onedrive - it-resultat sverige ab\strategy development_rlbook\deep-reinforcement-learning-hands-on-second-edition\chapter10\train_model.py", line 109, in
extra_metrics=('values_mean',))

File "C:\OneDrive - IT-Resultat Sverige AB\Strategy Development_rlbook\Deep-Reinforcement-Learning-Hands-On-Second-Edition\Chapter10\lib\common.py", line 91, in setup_ignite
tb = tb_logger.TensorboardLogger(log_dir=logdir)

File "C:\Anaconda3\envs\rlbook\lib\site-packages\ignite\contrib\handlers\tensorboard_logger.py", line 161, in init
self.writer = SummaryWriter(*args, **kwargs)

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\writer.py", line 301, in init
self._get_file_writer()

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\writer.py", line 353, in _get_file_writer
**self.kwargs)

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\writer.py", line 106, in init
logdir, max_queue, flush_secs, filename_suffix)

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\event_file_writer.py", line 104, in init
directory_check(self._logdir)

File "C:\Anaconda3\envs\rlbook\lib\site-packages\tensorboardX\record_writer.py", line 51, in directory_check
os.makedirs(path)

File "C:\Anaconda3\envs\rlbook\lib\os.py", line 223, in makedirs
mkdir(name, mode)

NotADirectoryError: [WinError 267] Katalognamnet är felaktigt: 'runs/2022-10-20T11:11-simple-RUN'
(Translated from Swedish to English above row is:
NotADirectoryError: [WinError 267] Folder name is wrong: 'runs/2022-10-20T11:11-simple-RUN'

[Question] How was speed up achieved on code 02_n_envs.py from chapter 9

My intuition was that the speed up using multiple environments simultaneously would be achieved by sampling the environments in parallel but from what I understood from the code it is being sampled in sequence, the environments are being sampled alternatively but still not in parallel.

Is the code indeed sequential? And if it is, how is it achieving a speed up?

Errors during install on Windows

I tried to install on Windows with a new conda virtual env with python=3.6, using:
pip install -r requirements.txt

The following error resulted:

ERROR: Could not find a version that satisfies the requirement torch==1.3.0 (from -r requirements.txt (line 6)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.3.0 (from -r requirements.txt (line 6))

However, if I pre-installed pytorch 1.3 using:
conda install pytorch==1.3.0 torchvision cudatoolkit=10.1 -c pytorch
and then pip installed with the --user option: pip install --user -r requirements.txt, the install was successful.

FYI, on linux, the pip install -r requirements.txt worked correctly (the pre-install of pytorch and the --user option was not needed).

Chapter 10

Recieve such an error, when replaced 2 linear layers with noisy_linear

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 512]] is at version 10005; expected version 10004 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Chapter03/04_atari_gan_ignite - Loss values order is wrong

Hello,

it seems the loss values are logged with a wrong name. RunningAverage binds to the wrong values.
Here out[0] should be avg_loss_dis and so out[1] should be avg_loss_gen:

Deep-Reinforcement-Learning-Hands-On-Second-Edition/Chapter03/04_atari_gan_ignite.py

Lines 193 to 196 in 4915540

 RunningAverage(output_transform=lambda out: out[0]).\ 

 attach(engine, "avg_loss_gen") 

 RunningAverage(output_transform=lambda out: out[1]).\ 

 attach(engine, "avg_loss_dis")

The order here is dis_loss first:

Deep-Reinforcement-Learning-Hands-On-Second-Edition/Chapter03/04_atari_gan_ignite.py

Line 188 in 4915540

return dis_loss.item(), gen_loss.item()

For CartPole in Chapter 4 This is the code that worked for me.

#!/usr/bin/env python3
import gymnasium as gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

def forward(self, x):
    return self.net(x)

Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

    if is_done:
        e = Episode(reward=episode_reward, steps=episode_steps)
        batch.append(e)
        episode_reward = 0.0
        episode_steps = []
        next_obs, _ = env.reset()
        if len(batch) == batch_size:
            yield batch
            batch = []
    obs = next_obs

def filter_batch(batch, percentile):
rewards = list(map(lambda s: s.reward, batch))
reward_bound = np.percentile(rewards, percentile)
reward_mean = float(np.mean(rewards))

train_obs = []
train_act = []
for reward, steps in batch:
    if reward < reward_bound:
        continue
    train_obs.extend(map(lambda step: step.observation, steps))
    train_act.extend(map(lambda step: step.action, steps))

train_obs_v = torch.FloatTensor(train_obs)
train_act_v = torch.LongTensor(train_act)
return train_obs_v, train_act_v, reward_bound, reward_mean

if name == "main":
env = gym.make("CartPole-v1", render_mode='human')
#env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-cartpole")

for iter_no, batch in enumerate(iterate_batches(
        env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = \
        filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.1f, rw_bound=%.1f" % (
        iter_no, loss_v.item(), reward_m, reward_b))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_bound", reward_b, iter_no)
    writer.add_scalar("reward_mean", reward_m, iter_no)
    if reward_m > 199:
        print("Solved!")
        break
writer.close()

1 Chapter08 - engine.run throw error when the input data is an iterator

Deep-Reinforcement-Learning-Hands-On-Second-Edition/Chapter08/01_dqn_basic.py

Line 63 in 4915540

engine.run(common.batch_generator(buffer, params.replay_initial,

ValueError: Argument epoch_length should be defined if data is an iterator

It seems that both epoch_length and max_epochs should be defined. According to other issues, This seems to be due to the version of the pytorch-ignite I installed does not match the requirements.

However, i am not sure how much should I assign to these two parameters? As for now, i just assign a really big number to max_epochs and batch_size to epoch_length.

How to adapt validation_run from Ch10 for Categorical DQN?

Hello, I'm trying to adapt the examples from Ch 8 & 10 from the book into a Double-Dueling Categorical architecture using Conv1d from Ch 10. Training seems to work fine using ptan and Pytorch ignite. I want to run validation though using openai gym, so I was wondering how to determine the next action for a new observation batch. My understanding is that for the normal dueling/double Q Conv1d network we run a forward pass of the observation through the trained network for the Q values, which we maximize to find action_idx. When running an observation through the categorical architecture however the book states a forward pass "returns the predicted probability distribution as a 3D tensor (batch, actions, and supports)." For a bar size of 10 I see clearly in my output that I get a (1,3,51) shaped tensor. But dim=1 looks to be various weights, not integers. What additional steps do I need to take in order to get the next step to take for the openai gym? Thanks in advance, and happy to post more code if needed.

My model:

class PlatformDQNDistr(nn.Module):
      def __init__(self, input_shape, n_actions):
        super(PlatformDQNDistr, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(input_shape[0], 128, 5),
            nn.ReLU(),
            nn.Conv1d(128, 128, 5),
            nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_shape)

        # We use Noisy networks rather than epsilon greedy action selection for exploration
        self.fc_val = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, n_actions * N_ATOMS)
        )
        sups = torch.arange(Vmin, Vmax + DELTA_Z, DELTA_Z)
        self.register_buffer("supports", sups)
        self.softmax = nn.Softmax(dim=1)

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        batch_size = x.size()[0]
        conv_out = self.conv(x).view(batch_size, -1)  # convolve batch
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return (val + adv - adv.mean(dim=1, keepdim=True)).view(batch_size, -1, N_ATOMS)

    def both(self, x):
        cat_out = self(x)
        probs_distribution = self.apply_softmax(cat_out)
        weights = probs_distribution * self.supports
        res = weights.sum(dim=2)
        return cat_out, res

    def q_vals(self, x):
        return self.both(x)[1]

    def apply_softmax(self, t):
        return self.softmax(t.view(-1, N_ATOMS)).view(t.size())

BATCH_MUL in Chapter09/03_parallel

Hello, really enjoying this second edition! I'm curious what this BATCH_MUL variable is for here.

Deep-Reinforcement-Learning-Hands-On-Second-Edition/Chapter09/03_parallel.py

Line 21 in 4915540

BATCH_MUL = 4

I see that you're sampling a larger batch, adjusting the epsilon tracker with it:

epsilon_tracker.frame(frame_idx/BATCH_MUL)

And setting the size of the Queue:

exp_queue = mp.Queue(maxsize=BATCH_MUL*2)

The FPS appeared low in my Tensorboard logs and, so I am not sure if this is what you intended. While I am seeing a speedup, I'm not seeing one quite as drastically as you are. It appears that there is still only a single env generating experience, but BATCH_MUL would make more sense to me if there were more than one. I'm planning to dig in deeper but just wanted to hear your thoughts on that. Thanks!

Wrappers changed for (CH 6 ) DQN solution

The wrappers in the baselines repo have been changed from those presented on CH6.

(https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py)

I tried to use the ones defined in ale_wrappers, but unfortunately I could not get the DQN to converge.
As soon as i changed to the newer ones, I was able to achieve convergence.

Bug Typo Chapter21 atari_ppo.py line 116 && 118

The code in question is here: https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition/blob/master/Chapter21/atari_ppo.py

The issue is that a list called envs is created and filled with env iwhich is to be passed off to the exp_source:

envs = []
for _ in range(N_ENVS):
    env = atari_wrappers.make_atari(params.env_name, skip_noop=True, skip_maxskip=True)
    env = atari_wrappers.wrap_deepmind(env, pytorch_img=True, frame_stack=True)
    if do_distill:
        env = common.NetworkDistillationRewardWrapper(
            env, reward_callable=get_distill_reward,
            reward_scale=params.distill_scale, sum_rewards=False)
    envs.append(env)

But then when the list envs is supposed to be passed off to the exp_source on lines 116 && 118 there's a type and env is passed off instead:

if do_distill:
    exp_source = common.DistillExperienceSource(env, agent, steps_count=1)
else:
    exp_source = ptan.experience.ExperienceSource(env, agent, steps_count=1)

Code should instead read:

if do_distill:
    exp_source = common.DistillExperienceSource(envs, agent, steps_count=1)
else:
    exp_source = ptan.experience.ExperienceSource(envs, agent, steps_count=1)

Very well written book, but PTAN is not worth.

Following your book, nicely explained concepts.

However PTAN i heavily used in demonstrations, but PTAN's experience.py is not readable, which is the main script of this library.

[Error] not run on gpu chap3

Deep-Reinforcement-Learning-Hands-On-Second-Edition/Chapter03/03_atari_gan.py

Line 176 in 3855b2f

gen_output_v = net_gener(gen_input_v)

PyTorch version : '1.4.0'

This will solve the error:
gen_output_v = net_gener(gen_input_v.to(device)).to(device)

wrongt sample code ling in Packpublishing

Hi,
The link in packpublising to the sample code for the second edition of your book goes to the code of the previous book release

Value of Discount Factor in ch8/02_dqn_n_steps.py

In chapter08,
Implementing the n-step DQN, the value default value of gamma is 0.96059601 (0.99 ** 4). The same value is passed calc_loss_dqn function.

If we use steps_count = 4, shouldn't the bellman update (after unrolling) be something like:

bellman_vals = next_state_vals.detach() * (gamma ** 4) + 3rd_next_state_vals.detach() * (gamma ** 3) + 2nd... + rewards_v
but it is instead, as per the function cacl_loss_dqn in common.py module:

bellman_vals = next_state_vals.detach() * gamma + rewards_v

In Chapter07, it is stated "there is an implementation of subtrajectory rollouts with accumulation of the reward.". But is this accumulation of reward discounted? I don't think so, since no parameter of gamma is passed

Issues with installation instructions

pip install requirements.txt - forgot the -r, should read pip install -r requirements.txt

Then, received following installation errors:

ERROR: Cannot install -r requirements.txt (line 10), -r requirements.txt (line 13), -r requirements.txt (line 4), -r requirements.txt (line 5), -r requirements.txt (line 9), atari-py==0.2.6, gym==0.17.3 and numpy==1.19.2 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested numpy==1.19.2
atari-py 0.2.6 depends on numpy
gym 0.17.3 depends on numpy>=1.10.4
opencv-python 4.4.0.46 depends on numpy>=1.14.5
ptan 0.7 depends on numpy
tensorboard 2.4.0 depends on numpy>=1.12.0
tensorboardx 2.1 depends on numpy
tensorflow 2.3.1 depends on numpy<1.19.0 and >=1.16.0

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

Rachel

Document the Kaitai Struct specs better and send them to Kaitai Struct Formats if it makes any sense

Hi. I have noticed some ksy specs in your repository implementing messages of some protocol.

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition/tree/master/Chapter16/ksy

Could you document them better?

Fill in the matadata block with all the needed metainfo (full title of the protocol, xrefs to its identifiers in other places, doc-refs to docs), write a short summary doc answering the following questions:

are they implementing something that is a de-facto standard or just has a widespread use in some area?
how can a sample blob be acquired (a link, or a script generating it)?

If it is something community can benefit from, could you send the specs to Kaitai Struct Formats repo?

Chapter 9 ignite.engine dependency issue

I downloaded pytorch-ignite using pip based on pytorch-ignite github

However i got error

$ python3 baseline.py --cuda
Traceback (most recent call last):
  File "baseline.py", line 98, in <module>
    engine.run(batch_generator(buffer, params.replay_initial, params.batch_size))
  File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 590, in run
    raise ValueError("Argument `epoch_length` should be defined if `data` is an iterator")
ValueError: Argument `epoch_length` should be defined if `data` is an iterator

After i comment line 589 and 590 of engine.py of ignite, baseline.py works

Which version of ignite i have to use?

I tried pytorch-ignite 0.3.0 and 0.4.0 but both didn't work without commenting few lines of engine.py

In 0.4.0 i commented line 589 and 590 and in 0.3.0 i commented line 834 and 835

pong_A2C might backward policy-loss to value net

I was reviewing the this code, then I thought that it is possible that the policy loss might impact on value branch net.
If you take a look at line 143 which is adv_v = vals_ref_v - value_v.detach() that is computing advantage, the value_v is detached to prevent policy loss to impact on value net in the backward process, but if you consider computing vals_ref_v which is conducted by function unpack_batch, then you will find out at line 90 last_vals_v = net(last_states_v)[1] the value net is involved in computing the vals_ref_v.
In result I think that the line 143 must get changed from adv_v = vals_ref_v - value_v.detach() to adv_v = (vals_ref_v - value_v).detach()

Problems with chapter 6 DQN code

Hello, I ran the code using windows and ### Jupiter notebook. I got this error message:

Error message

TypeError                                 Traceback (most recent call last)
<ipython-input-6-ebd6b7e4996f> in <module>
     61         iter_no += 1
     62         s, a, r, next_s = agent.sample_env()
---> 63         agent.value_update(s, a, r, next_s)
     64 
     65         reward = 0.0

<ipython-input-6-ebd6b7e4996f> in value_update(self, s, a, r, next_s)
     35         best_v, _ = self.best_value_and_action(next_s)
     36         new_v = r + GAMMA * best_v
---> 37         old_v = self.values[(s, a)]
     38         self.values[(s, a)] = old_v * (1-ALPHA) + new_v * ALPHA
     39 

TypeError: unhashable type: 'dict'

The original code::

#!/usr/bin/env python3
import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9
ALPHA = 0.2
TEST_EPISODES = 20


class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self):
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return old_state, action, reward, new_state

    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_v = r + GAMMA * best_v
        old_v = self.values[(s, a)]
        self.values[(s, a)] = old_v * (1-ALPHA) + new_v * ALPHA

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state)
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-learning")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        s, a, r, next_s = agent.sample_env()
        agent.value_update(s, a, r, next_s)

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (
                best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

	RunningAverage(output_transform=lambda out: out[0]).\
	attach(engine, "avg_loss_gen")
	RunningAverage(output_transform=lambda out: out[1]).\
	attach(engine, "avg_loss_dis")

packtpublishing / deep-reinforcement-learning-hands-on-second-edition Goto Github PK