ucla-rlcourse / rlexample Goto Github PK

Some basic examples of playing with RL

Python 85.42% TeX 14.58%

rlexample's Introduction

Some basic examples for reinforcement learning

Installing Anaconda and Gymnasium

Download and install Anaconda here
Create conda env for managing dependencies and activate the conda env

conda create -n conda_env
conda activate conda_env

Install gymnasium (Dependencies installed by pip will also go to the conda env)

pip install gymnasium[all]
pip install gymnasium[accept-rom-license]

# Try the next line if box2d-py fails to install.
conda install swig

Install ai2thor if you want to run navigation_agent.py

pip install ai2thor==2.4.10

Install torch with either conda or pip

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

pip install torch torchvision torchaudio

Install other dependencies

pip install numpy pandas matplotlib

Examples

Play with the environment and visualize the agent behaviour

import gymnasium as gym
render = True # switch if visualize the agent
if render:
    env = gym.make('CartPole-v0', render_mode='human')
else:
    env = gym.make('CartPole-v0')
env.reset(seed=0)
for _ in range(1000):
    env.step(env.action_space.sample()) # take a random action
env.close()

Random play with CartPole-v0

import gymnasium as gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        print(observation)
        action = env.action_space.sample()
        observation, reward, terminated, truncated, info = env.step(action)
        done = np.logical_or(terminated, truncated)
env.close()

Example code for random playing (Pong-ram-v0,Acrobot-v1,Breakout-v0)

python my_random_agent.py Pong-ram-v0

Very naive learnable agent playing CartPole-v0 or Acrobot-v1

python my_learning_agent.py CartPole-v0

Playing Pong on CPU (with a great blog). One pretrained model is pong_model_bolei.p(after training 20,000 episodes), which you can load in by replacing save_file in the script.

python pg-pong.py

Random navigation agent in AI2THOR

python navigation_agent.py

rlexample's People

Contributors

Stargazers

Watchers

Forkers

dull-bird weilongzheng chuangchuangtan zfxu jwgu yucoian jiaodaxiaozi sjtuqizhenlin dingrong123 lulusindazc eliver8801 akirannz zhangruiskyline mvisionai zonghanli northocean zhangyunpu stoneshuyao mingming741 jackiexuw lesliezhang1314 janetwise ffeng1996 pengzhenghao gp-chen zhaoyinfu123 marmotatzju zhangnanxi vincentls striderw zoeson 18292753687 superwind hijaen tracy173 chenkehan21 liuduo2019102962 alanwangxiao keruhua zhb-xjtu tiyife smileofheart heynin nudtxiong cdyangbo uestcwangxiao taoboq harryjd rnjia staminatang luo-li wangdongdo magi803 zeng0119 zhengjinquan renhongquan lrjxaint ziyiliubird tornadotime coderimoe jennysrh bullud zyt755 flypark666 wuhanllliu sadudu jackymail hsiachubby edwardhk sizhewei jackyin68 bibibibibit guoea liuxiang2020 wanglijia123 leon-liangwu scorpio-h qiu1234567 liudicsu garyandtang limmeryang gusting alan963852741 chocomoon chaoshunh chnxindong atalanta0505 jianpei-w szcf-weiya wee-flame sustechbruce xzycr7 zhangzheming33 gaopenggang motor2kevin cohenqu ln2426557273 jionie guilinf zhaoju37

rlexample's Issues

The small problem in value_iteration code

In line 79, sum([p*(r + prev_v[s_]) lack the gamma (the gamma=1.0 is not affected the result). The right code is sum([p*(r + gamma*prev_v[s_]) in line 70.
Thanks.

BEETLE Algorithm

Is there an implementation of BEETLE algorithm from paper "An Analytic Solution to Discrete bayesian RL"? Thanks!

Jupyter Notebook Vs Anaconda's Spyder

I used to code in Spyder which is a much better IDE. Could I use Spyder instead of Jupyter?

Question about the policy evaluation function and policy extraction of the MDP code

Hi, thanks for your course! It helps me a lot.

I have some question about the code in frozenlake_policy_iteration.py. Why is the expression of the value fuction in compute_policy_v (line 52) same as the state-action function in compute_policy_v (line37) ?

And why is the expression of the value function v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.env.P[s][policy_a]]) different from the formula(17) in the slide? It seems that the expression in the code ignore the transition probability P(s'|s,a)?

Thanks! Look forward your reply~

Problems with code in pgb-pong-pytorch and pg-pong-pytorch

Hi, I assumed there are some errors with the above two algorithms codes. Basically, they are similar.

In both of them, Professor used "args.batch_size" to update model params every batch_size episodes, this corresponds to what was presented in professor's lecture slide 5. But in the defined function: finish_episode(), G is calculated for every single episode, I guess you might forget to separate rewards in each episode since you also commented in the ac-pong codes and flatten rewards and values you defined for calculation.

If the model is updated every batch_size time, then policy.rewards should append a [] for every episode separately. Hope my understanding is correct.

No module named 'gym.envs.atari'

Policy loss computes gradients for value network

This issue exists in ac-pong-pytorch.py and pgb-pong-pytorch.

I am investigating this problem. It is the possible cause to the training failure.

policy iteration doesn't work for deterministic frozen_lake env

I am trying to use the code as an example. Well, it is a little bit strange, when I changed the frozen lake env to the deterministic version, e.g. env = gym.make("FrozenLake=v0, is_slippery=False), I found the policy iteration algorithm can't work correctly. I checked the code, it seems nothing wrong. One of the reason might be the insufficient exploration of the agent, however, the env is simple enough and the default iteration numbers are set to 200000. But the problem still can't be solved.