udacity / deep-reinforcement-learning Goto Github PK

Repo for the Deep Reinforcement Learning Nanodegree program

Home Page: https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893

License: MIT License

TeX 0.77% Jupyter Notebook 85.63% Python 13.60%

deep-reinforcement-learning reinforcement-learning reinforcement-learning-algorithms neural-networks pytorch pytorch-rl ddpg dqn ppo dynamic-programming

deep-reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

o-can tarrysingh nyn10 william251082 neerajsarwan vishwas31 nivir palashc benzei awesome-resources rahulvijayvargiya parsh jalajthanaki tani-git rizavelioglu 21tushar hzitoun ilopezfr amitkayal xiaohalo anshul1927 hal2001 sm-azure self-education-liavontsi-brechka snabeel michaelchi08 oppa3109 jayeshd7 deeplearningpk fundou mehrdad-shokri bapisth etendue getaleks ofialko gigwegbe prinzpiuz yushu-liu lfwin jdc08161063 rutujawanjari jacksonisaac shaunstanislauslau smyrbdr allensmile o7s8r6 mnrmja007 hadhoryth momor666 mccorby manik-50000 bayesianbrad anu-bioinfo badrinaths collawolley v3dant abinfinity athul100 girishkuniyal mrinalarun er-vivekkumar tusharbihani codeaudit hulalazz jknthn adityabhat95 rubenszimbres kapilkhanal gm-gwu sanjitjain2 mbrukman wolegechu shyamsunder007 chrislit merz9b locnguyenhuu decoderkurt zgsxwsdxg insect-collector andandandand enavarroai aishwarya4444 adityapai2398 samratkashipathi abhik01 mfornet navinthenapster pupupulp eternalfeather raghavendra-gali ah9988 zach14c logan27 abodacs esmaeilinia sondro gabvaztor sandipan1 unnat5 claudecoulombe

deep-reinforcement-learning's Issues

low computation resource utilization

I ran reinforce on the server of my lab, which have RTX 3090.
Dramatically, the gpu usage is about 30 percent when running REINCORCE on one card. At the same time the cpu usage percent is not greater than 80%.
Can you tell me what is the real problem. Is that the problem of REINFORCE of is there something wrong with my code ?
I would appreciate it so much if you could help figure out what cause the problem, thank you in advance.

README doesn't reference required Visual Studio Build Tools for Windows

The OpenAI Gym installation instructions are missing reference to the "Build Tools for Visual Studio 2019" from the following site.

https://visualstudio.microsoft.com/downloads/

I also found this by reading the following article.
https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30

Even though this is an issue in the OpenAI gym, a note in this README would be very helpful.

openai/gym#11

About Crawler for Continuous Control

I am trying to solve the Crawler environment in Continuous control task. I have read the Unity webpage and realized that there were two environments. One with static target and one with dynamic target. Which one is provided through the links? Thank you!

"xdpyinfo" error in Discretization.ipynb

Running the first code-block in Discretization.ipynb gives:

xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!

This doesn't stop the rest of the notebook from executing, but it is misleading and should be fixable with a few config commands.

Is there an estimation as to when PPO will be finished?

The reason I ask is because I tried taking the course last year but much of it wasn't complete yet and have been waiting specifically for PPO. Thank you!

TypeError when passing device

I'm not sure why this is happening. I've printed the type of device and find that it's of class torch.device. The relevant code is below (unchanged from Udacity's version aside from the print statement) and the error is below that. Could it be an issue with ML-Agents?

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(type(device))

class Agent():
    """Interacts with and learns from the environment."""
    
    def __init__(self, state_size, action_size, random_seed):
        """Initialize an Agent object.
        
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            random_seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(random_seed)
        
        # Actor Network (w/ Target Network)
        self.actor_local = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)

Traceback (most recent call last):
  File "train.py", line 87, in <module>
    agent_1 = Agent(state_size=48, action_size=action_size, random_seed=0)
  File "C:\Users\Tester\ml-agents\ml-agents\mlagents\trainers\ddpg\ddpg_agent.py", line 39, in __init__
    self.actor_local = Actor(state_size, action_size, random_seed).to(device)
  File "C:\Users\Tester\ml-agents\ml-agents\mlagents\trainers\ddpg\model.py", line 29, in __init__
    self.fc3 = nn.Linear(fc2_units, action_size)
  File "C:\Users\Tester\AppData\Local\conda\conda\envs\ml-agents\lib\site-packages\torch\nn\modules\linear.py", line 51, in __init__
    self.weight = Parameter(torch.Tensor(out_features, in_features))
TypeError: new() received an invalid combination of arguments - got (google.protobuf.pyext._message.RepeatedScalarContainer, int), but expected one of:
 * (torch.device device)
 * (torch.Storage storage)
 * (Tensor other)
 * (tuple of ints size, torch.device device)
      didn't match because some of the arguments have invalid types: (e[31;1mgoogle.protobuf.pyext._message.RepeatedScalarContainere[0m, e[31;1minte[0m)
 * (object data, torch.device device)
      didn't match because some of the arguments have invalid types: (e[31;1mgoogle.protobuf.pyext._message.RepeatedScalarContainere[0m, e[31;1minte[0m)

Calculate correctly the fan-in for DDPG model

fan_in = layer.weight.data.size()[0]. This is wrong, because fan-in is defined as the maximum number of input units to the layer. The weight matrix is transposed (!), thus we need to access the second component of the size, i.e. fan_in = layer.weight.data.size()[1]

See example of correct implementation using fan-in here: https://pytorch.org/docs/stable/_modules/torch/nn/init.html#kaiming_normal_
specifically def _calculate_fan_in_and_fan_out(tensor)

DDPG pendulum action scaling

Hello

In the file: deep-reinforcement-learning/ddpg-pendulum/DDPG.ipynb

In the Pendulum-v0 environment, the actions are in the range from -2.0 to +2.0
And hence, actions must be scaled before passing to the environment (the actor produces a tanh values from -1 to 1, limiting the agent to act properly)

action = agent.act(state)
action *= 2.0  # add this line of code
next_state, reward, done, _ = env.step(action)

After I added this scaling factor, the agent converged much faster and even much better

Moreover, I used a simpler networks with only 32, 128 units instead of the 400 and 300.

Best Regards

Unable to use unityagents 0.4.0 with MLAgents 0.8.1

vector_observations errors in the soccer2 environment

Recently, I was doing some RL experiments on the soccer2 environment, following the https://github.com/udacity/deep-reinforcement-learning/blob/master/p3_collab-compet/Soccer.ipynb. However, the agents couldn’t learn successfully. And I found there are some errors in the vector_observations in the soccer2 environment.
The agent is expected to dected some objects like{ "ball", "redGoal", "blueGoal","wall", "redAgent", "blueAgent", if it is in redTeam. However, when I print the vector_observations, the agents can’t detect "redGoal" and "blueGoal" all the time. These values are always 0. I would appreciate it if you could fix this error and add these information. These will help me a lot.
Thanks.

Use Pseudocount of Ones to Avoid Divide by Zero

In Monte Carlo Solution Notebook and the assignment notebook, the count dictionary (N) uses default value of zeros. Since not all actions at certain state will get updated (especially in my case of using First-Visit MC Prediction), it is better to use default value of ones.

To replicate the issue:

$ python
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.array([2, 0]) / np.array([1, 0])
__main__:1: RuntimeWarning: invalid value encountered in true_divide
array([ 2., nan])

Thinking of replacing this:

N = defaultdict(lambda: np.zeros(env.action_space.n))

with this:

N = defaultdict(lambda: np.ones(env.action_space.n))

This implementation is from the cheatsheet. The textbook has no issue as it only mentions average(Returns()).

DQL size/reshape error

Hi.

I am new with DQL. I am working with AirSim simulator, and I coded an algorithm on Python on Visual Studio, using keras, to teatch to the drone to avoid obstacles. When I launched the train, the algorithm looks like to work normaly in the begining, but after iteration 400, 1300 or 2308 (it always changes) I have the following error that appear.

I used 'reshape' function only in 2 functions :

Here below is my full code.
`
import numpy as np
import airsim
import time
import math
import tensorflow as tf
import keras
from airsim.utils import to_eularian_angles
from airsim.utils import to_quaternion
from keras.layers import Conv2D,Dense
from keras.layers import Activation
from keras.layers import MaxPool2D
from keras.layers import Dropout
from keras.layers import Input
import keras.backend as K
from keras.models import load_model
from keras import Input
from keras.layers import Flatten
from keras.activations import softmax,elu,relu
from keras.optimizers import Adam
from keras.optimizers import adam
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop
from keras.models import Model
#tf.compat.v1.disable_eager_execution()
import random

from collections import deque

client=airsim.MultirotorClient()
z=-5
memory_size=10000000
#pos_0=client.getMultirotorState().kinematics_estimated.position

#state_space=[84, 84]
#action_size=3

def OurModel(state_size,action_space):

X_input=Input(state_size,name='Input')
X=Conv2D(filters=32,kernel_size=(8,8),strides=(4,4),padding='valid',activation='relu')(X_input)
X=MaxPool2D(pool_size=(2,2))(X)
X=Conv2D(filters=64,kernel_size=(4,4),strides=(2,2),padding='valid',activation='relu')(X)
X=MaxPool2D(pool_size=(2,2))(X)
X=Conv2D(filters=64,kernel_size=(1,1),strides=(1,1),padding='valid',activation='relu')(X)

X=Flatten()(X)

X=Dense(525,activation='relu')(X)
X=Dense(300,activation='relu')(X)
X_output=Dense(action_space,activation='softmax')(X)

model=Model(inputs = X_input, outputs = X_output)
model.compile(loss="mse", optimizer=RMSprop(lr=0.0005, rho=0.95, epsilon=0.01), metrics=["accuracy"])
model.summary()

return model

class MemoryClass():
def init(self,memory_size):
self.memory_size=memory_size
self.buffer=deque(maxlen=memory_size)
self.batch_size=64
#self.start_training=20

def add(self,experience):
    self.buffer.append(experience)

def sample(self):
    buffer_size=len(self.buffer)
    idx=np.random.choice(np.arange(buffer_size),self.batch_size,False)
    return [self.buffer[k] for k in idx]

def replay(self):
    batch=self.sample()
    next_states_mb=np.array([each[0] for each in batch],ndmin=3)
    actions_mb=np.array([each[1] for each in batch])
    states_mb=np.array([each[2] for each in batch],ndmin=3)
    rewards_mb=np.array([each[3] for each in batch])
    dones_mb=np.array([each[4] for each in batch])

    return next_states_mb, actions_mb, states_mb, rewards_mb,dones_mb

class Agent():
def init(self):
self.state_size=(84, 84,1)
self.action_space=3
#self.DQNNetwork=DQNN(state_size,action_space)
self.model1=OurModel(self.state_size,self.action_space)
self.memory_size=10000000
self.memory=MemoryClass(memory_size)
self.gamma=0.75
self.epsilon_min=0.001
self.epsilon=1.0
self.epsilon_decay=0.995
self.episodes=120
self.max_step=120
self.step=0
self.count=0
self.pos0=client.getMultirotorState().kinematics_estimated.position
self.z=-5
self.goal_pos=[50,50]
self.initial_position=[0,0]
self.initial_distance=np.sqrt((self.initial_position[0]-self.goal_pos[0])**2+(self.initial_position[1]-self.goal_pos[1])**2)
self.batch_size=30

def generate_state(self):
    responses = client.simGetImages([airsim.ImageRequest("0", airsim.ImageType.DepthPerspective, True, False)])
    img1d = np.array(responses[0].image_data_float, dtype=np.float)
    #img1d = 255/np.maximum(np.ones(img1d.size), img1d)
    img2d = np.reshape(img1d, (responses[0].height, responses[0].width))

    from PIL import Image
    image = Image.fromarray(img2d)
    im_final = np.array(image.resize((84, 84)).convert('L'))
    im_final=np.reshape(im_final,[*self.state_size])
    return im_final

def load(self, name):
    self.model1 = load_model(name)


def save(self, name):
    self.model1.save(name)

def get_yaw(self):
    quaternions=client.getMultirotorState().kinematics_estimated.orientation
    a,b,yaw_rad=to_eularian_angles(quaternions)
    yaw_deg=math.degrees(yaw_rad)
    return yaw_deg,yaw_rad

def rotate_left(self):
    client.moveByRollPitchYawrateZAsync(0,0,0.2,self.z,3)
    n=int(3*5)
    D=[]
    done=False
    for k in range(n):
        collision=client.simGetCollisionInfo().has_collided
        done=collision
        D.append(collision)
        time.sleep(3/(n*300))
    if True in D:
        done=True
    time.sleep(3/300)
    time.sleep(5/300)
    new_state=self.generate_state()
    return done,new_state

def rotate_right(self):
    client.moveByRollPitchYawrateZAsync(0,0,-0.2,self.z,3)
    n=int(3*5)
    D=[]
    done=False
    for k in range(n):
        collision=client.simGetCollisionInfo().has_collided
        done=collision
        D.append(collision)
        time.sleep(3/(n*300))
    if True in D:
        done=True
    time.sleep(3/300)
    time.sleep(5/300)
    new_state=self.generate_state()
    return done,new_state

def move_forward(self):
    yaw_deg,yaw_rad=self.get_yaw()
    #need rad
    vx=math.cos(yaw_rad)*0.25
    vy=math.sin(yaw_rad)*0.25
    client.moveByVelocityAsync(vx,vy,0,10,airsim.DrivetrainType.ForwardOnly,airsim.YawMode(False))
    done=False
    n=int(10*5)
    D=[]
    done=False
    for k in range(n):
        collision=client.simGetCollisionInfo().has_collided
        D.append(collision)
        time.sleep(3.4/(n*300))
    if True in D:
        done=True
    new_state=self.generate_state()
    time.sleep(15/300)
    return done,new_state

def step_function(self,action):
    # Returns action,new_state, done
    # Move forward 3 meters by Pitch
    done=False
    if action==0:
        done,new_state=self.move_forward()
    # Rotate to right by 20 degress
    elif action==1:
        done,new_state=self.rotate_right()
    # Rotate to left by 30 degress
    elif action==2:
        done,new_state=self.rotate_left()
    self.count+=1
    return action,new_state,done


def compute_reward(self,done):
    reward=0.0
    pos_now=client.getMultirotorState().kinematics_estimated.position
    dist=np.sqrt((pos_now.x_val-self.goal_pos[0])**2+(pos_now.y_val-self.goal_pos[1])**2)
    print('dist: ',dist)

    if done==False and self.step<self.max_step:
        reward+=(self.initial_distance-dist)*6
        if 10<self.step<40 and dist>self.initial_distance*3/4:
            reward=-2-(self.step-10)
        elif 50<self.step<80 and dist>self.initial_distance*2/4:
            reward=-36-(self.step-50)
        elif 80<self.step<self.max_step and dist>self.initial_distance*1/4:
            reward=-80-(self.step-80)
        elif dist<3:
            reward+=650.0
    elif done==True and dist>3:
        reward-=180.0
    print('reward: ',reward)
    return reward


def choose_action(self,state):
    r=np.random.rand()
    print('r: ',r)
    print('epsilon: ',self.epsilon)
    print()
    if r>self.epsilon and self.count>64:
        #print('predicted action')
        state=np.reshape(state,[1,*self.state_size])

        #action=np.argmax(self.DQNNetwork.OurModel.predict(state))
        action=np.argmax(self.model1.predict(state))
    else:
        action=random.randrange(self.action_space)
    return action


def reset(self):
    client.reset()


def initial_pos(self):
    client.enableApiControl(True)
    v=0.6
    #z0=client.getMultirotorState().kinematics_estimated.position.z_val
    #t=np.abs(z0-self.z)/v
    client.moveToZAsync(self.z,v).join()
    #time.sleep(t+1)
    

def epsilon_policy(self):
    # Update epsilon
    if self.epsilon>self.epsilon_min:
        self.epsilon*=self.epsilon_decay


def train(self):
    for episode in range(self.episodes):
        self.initial_pos()
        self.step=0
        state=self.generate_state()
        done=False
        total_reward,episode_rewards=[],[]
        while self.step<self.max_step:
            self.step+=1
            print('count:', self.count)
            choice=self.choose_action(state)
            self.epsilon_policy()
            action,next_state,done=self.step_function(choice)
            reward=self.compute_reward(done)
            episode_rewards.append(reward)
            if done==True:
                total_reward.append(sum(episode_rewards))
                self.memory.add([next_state,action,state,reward,done])
                self.step=self.max_step
                self.reset()
                print("episode: {}, epsilon: {:.5}, total reward :{}".format(episode, self.epsilon,total_reward[-1]))
                self.save("airsim-dqn.h5")
            else:
                state=next_state
                self.memory.add([next_state,action,state,reward,done])
            if len(self.memory.buffer)>64:
                next_states_mb, actions_mb, states_mb, rewards_mb,dones_mb=self.memory.replay()
                target = self.model1.predict(states_mb)
                target_next = self.model1.predict(next_states_mb)
                for k in range(len(dones_mb)):
                    if dones_mb[k]==True:
                        target[k][actions_mb[k]] = rewards_mb[k]
                    elif dones_mb[k]==False:
                        target[k][actions_mb[k]] = rewards_mb[k] + self.gamma * (np.amax(target_next[k]))
                self.model1.fit(x=states_mb,y=target,batch_size=self.batch_size)

agent=Agent()
agent.train()

AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'

Thank u a lot for your program, and here is a problem i met when i ran the deep-reinforcement-learning/dqn/solution/Deep_Q_Network_Solution.ipynb on the google colab platform.

as u can see, I cant make the env successfully.

A possible plagiarism incident?

The code of this paper, https://github.com/WenhangBao/Multi-Agent-RL-for-Liquidation has an uncanny similarity to https://github.com/udacity/deep-reinforcement-learning/tree/master/finance. The paper has since been submitted to ICML 2019 proceeding.

Either the authors of the paper or Udacity must be the original author. Since https://github.com/udacity/deep-reinforcement-learning/tree/master/finance was committed earlier (~10 months ago), I have a tentative conclusion that the Udacity team was the original author of the code.

I am raising this issue because I have been doing research on similar system, and we want to credit the right party.

tensorflow-gpu in requirements.txt?

I changed my local requirements.txt to include tensorflow-gpu==1.7.1 because "pip install ." tried to download/install non gpu tensorflow but tensorflow-gpu==1.7.1 was already installed.

Maybe the requirements.txt can change to include tensorflow-gpu==1.7.1 ?

README doesn't indicate how to contribute.

The README.md doesn't indicate how to contribute. I made a local branch to fix issue #34 but when I try to push the branch I get the following error, which indicates I don't have permission.

Please indicate in the README how to contribute.

Thanks.

Banana.exe Not working

When trying to complete the p1_navigation project, I am not able to open the environment Banana.exe neither from Ubuntu 20.04 nor Windows 10. When I try to run the code cell listed below,

env = UnityEnvironment(file_name="Banana_Windows_x86_64/Banana.exe")

The error that I get is the following:

`timeout Traceback (most recent call last)
~\anaconda3\envs\udacity_rl\lib\site-packages\unityagents\environment.py in init(self, file_name, worker_id, base_port, curriculum)
98 self._socket.listen(1)
---> 99 self._conn, _ = self._socket.accept()
100 self._conn.settimeout(30)

~\anaconda3\envs\udacity_rl\lib\socket.py in accept(self)
292 """
--> 293 fd, addr = self._accept()
294 sock = socket(self.family, self.type, self.proto, fileno=fd)

timeout: timed out

During handling of the above exception, another exception occurred:

UnityTimeOutException Traceback (most recent call last)
in
----> 1 env = UnityEnvironment(file_name="Banana_Windows_x86_64/Banana.exe")

~\anaconda3\envs\udacity_rl\lib\site-packages\unityagents\environment.py in init(self, file_name, worker_id, base_port, curriculum)
102 p = json.loads(p)
103 except socket.timeout as e:
--> 104 raise UnityTimeOutException(
105 "The Unity environment took too long to respond. Make sure {} does not need user interaction to "
106 "launch and that the Academy and the external Brain(s) are attached to objects in the Scene."

UnityTimeOutException: The Unity environment took too long to respond. Make sure Banana_Windows_x86_64/Banana does not need user interaction to launch and that the Academy and the external Brain(s) are attached to objects in the Scene.
`

Please help. Thank you.

Discretization Issue when Creating Uniform Grid

In Discretization Solution notebook, space [0.2 , -1.9] should be mapped into grid [6, 3] as described before In [8]. But the solution of In [8] is [5, 3] instead. I did the debugging and found that the issue caused by create_uniform_grid function. This notebook produces the expected result.

scaling output from tanh activation in the actor

def act(self, state, add_noise=True):
        """Returns actions for given state as per current policy."""
        state = torch.from_numpy(state).float().to(device)
        self.actor_local.eval()
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy()
        self.actor_local.train()
        if add_noise:
            action += self.noise.sample()
        return np.clip(action, -1, 1)

The above looks like it is clipping the action to a range within -1 to 1 but Pendulum-v0 has action range of -2 to 2 doesn't it? How does this work out?

REINFORCE Correction

Hello,

In deep-reinforcement-learning/reinforce/REINFORCE.ipynb

R is implemented as a single value in the following code:

discounts = [gamma**i for i in range(len(rewards)+1)]
R = sum([a*b for a,b in zip(discounts, rewards)])

However it should be implemented as discounted values as follow:

        discounted_rewards = []
        for t in range(len(rewards)):
          Gt = 0 
          pwr = 0
          for r in rewards[t:]:
              Gt = Gt + gamma**pwr * r
              pwr = pwr + 1
          discounted_rewards.append(Gt)
        
        policy_loss = []
        for log_prob, Gt in zip(saved_log_probs, discounted_rewards):
          policy_loss.append(-log_prob * Gt)

This important correction is compatible with the REINFORCE algorithm and leads to a faster and more stable training as shown in the figure.

Description of Reacher environment

I tried to collect the rewards for every timestep for a random agent and I found that most non-zero rewards are 0.04, I also got some 0.03, 0.02 and 0.01, whose counts are much less than 0.04. But the description says the reward for any timestep should be 0 or 0.1. Are there more details? Thanks!

OUNoise should use normal distribution

The current implementation uses random.random() which I believe is uniform distribution between [0,1). This can negatively affect exploration abilities of DDPG agent, since noise will have positive bias.

ERROR: No matching distribution found for tensorflow==1.7.1 (from unityagents==0.4.0)

Hi guys,

I have a couple of questions:

Is the TensorFlow successful installation necessary for the package? Based on #1 it might not even be required, though the reply was 2 years ago.
In case it is required, would TS v2 do the trick? I remember that there's a bit of change in the API but no idea what.
Why is the suggestion to use conda for the environment and then install packages via pip? Doesn't conda have all required dependencies?

Below is my bash log from trying to install this project's dependences. I have initially initially tried pip install, then, after some investigation, I took steps below. My OS is Ubuntu 20.04.

(drlnd) kretyn@junk:~/courses/DeRL/deep-reinforcement-learning/python$ pip install --upgrade pip
Collecting pip
  Downloading pip-20.1-py2.py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 706 kB/s 
Installing collected packages: pip
Successfully installed pip-20.1
(drlnd) kretyn@junk:~/courses/DeRL/deep-reinforcement-learning/python$ python --version
Python 3.6.10 :: Anaconda, Inc.
(drlnd) kretyn@junk:~/courses/DeRL/deep-reinforcement-learning/python$ pip install .
Processing /home/kretyn/courses/DeRL/deep-reinforcement-learning/python
Requirement already satisfied: Pillow>=4.2.1 in /usr/lib/python3/dist-packages (from unityagents==0.4.0) (7.0.0)
Collecting docopt
  Using cached docopt-0.6.2.tar.gz (25 kB)
Collecting grpcio==1.11.0
  Using cached grpcio-1.11.0.tar.gz (14.2 MB)
Collecting ipykernel
  Using cached ipykernel-5.2.1-py3-none-any.whl (118 kB)
Collecting jupyter
  Using cached jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting matplotlib
  Using cached matplotlib-3.2.1-cp38-cp38-manylinux1_x86_64.whl (12.4 MB)
Requirement already satisfied: numpy>=1.11.0 in /usr/lib/python3/dist-packages (from unityagents==0.4.0) (1.17.4)
Collecting pandas
  Using cached pandas-1.0.3-cp38-cp38-manylinux1_x86_64.whl (10.0 MB)
Collecting protobuf==3.5.2
  Using cached protobuf-3.5.2-py2.py3-none-any.whl (388 kB)
Collecting pytest>=3.2.2
  Using cached pytest-5.4.1-py3-none-any.whl (246 kB)
Requirement already satisfied: pyyaml in /usr/lib/python3/dist-packages (from unityagents==0.4.0) (5.3.1)
Requirement already satisfied: scipy in /home/kretyn/.local/lib/python3.8/site-packages (from unityagents==0.4.0) (1.4.1)
ERROR: Could not find a version that satisfies the requirement tensorflow==1.7.1 (from unityagents==0.4.0) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4)
ERROR: No matching distribution found for tensorflow==1.7.1 (from unityagents==0.4.0)

Unable to use deep reinforcement learning repository with latest MLAgents

Untill now, I am only able use this repository with MLAgents 0.4 version. How can I use it with latest versions of MLAgents.

Can't setup environment due to outdated requirements

I have been trying to set this up on macOS 12.2.1 Monterey, but all of the software (unityagents, torch, etc) are so old it doesn't install. I used some more modern libraries, but it fails. So I can run the Banana.app, but it times out with:

E0509 13:07:17.806027000 4446950912 fork_posix.cc:76]                  Other threads are currently calling into gRPC, skipping fork() handlers
Mono path[0] = '/Users/kevin/tmp/udacity/drl/udacity-deep-reinforcement-learning-master/p1_navigation/Banana.app/Contents/Resources/Data/Managed'
Mono config path = '/Users/kevin/tmp/udacity/drl/udacity-deep-reinforcement-learning-master/p1_navigation/Banana.app/Contents/MonoBleedingEdge/etc'

Is there any way you can update this to current versions of mlagents (unityagent replacement), torch, python, etc?

torch requirement outdated

The requirements.txt file includes torch==0.4.0
This throws an error as this version is not available any longer, also preventing the packages further down the list from being installed.

Banana environment throws a timeout on Windows64

This issue refers to the Navigation task here

This won't work on Windows64, as the environment throws a timeout error and fails to produce the required 'env' object.
Refer to the unresolved issues on the Udacity knowledge base here, here and here.

running environment on Nvidia RTX cards

I found a small issue with the instructions for setting up the udacity environment on a local computer that has an RTX Nvidia card (in my case the RTX 2080Ti). I needed to replace the pytorch 0.4 version (https://github.com/udacity/deep-reinforcement-learning/blob/master/python/requirements.txt#L11) - with the latest pytorch (1.3.1), otherwise the environment just will just not start.

DDPG - Bipedal is not converging

In this project we clearly see there is no learning happening : https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/DDPG.ipynb
This example should converge and solve the problem.

C:\ProgramData\Anaconda3\envs\drlnd\python.exe: No module named ipykernel

Hello everyone,
I was following the tutorial to install ml agents, but when executing the line:
python -m ipykernel install --user --name drlnd --display-name "drlnd"

I found Error :

C:\ProgramData\Anaconda3\envs\drlnd\python.exe: No module named ipykernel

I'm on windows 10 and I've installed anaconda 3 python 3.6

Installation on windows 10 failed

Hello,

I followed the instruction and tried to install from the requirement.txt in python folder. But unfortunately I got the following error message. Can you please help to resolve?

Thanks,
Kim

(drlnd) deep-reinforcement-learning\python

pip install .

Collecting torch==0.4.0 (from unityagents==0.4.0)
Could not find a version that satisfies the requirement torch==0.4.0 (from unityagents==0.4.0) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.4.0 (from unityagents==0.4.0)