mishalaskin / curl Goto Github PK

CURL: Contrastive Unsupervised Representation Learning for Sample-Efficient Reinforcement Learning

License: MIT License

Python 99.15% Shell 0.85%

contrastive-learning contrastive-loss contrastive-predictive-coding curl deep-learning deep-learning-algorithms deep-neural-networks deep-q-learning deep-q-network deep-reinforcement-learning deep-rl deeplearning deeplearning-ai gpu model-free-rl off-policy reinforcement-agents reinforcement-learning reinforcement-learning-algorithms sac

curl's People

Contributors

Stargazers

Watchers

Forkers

timohear saminyeasar sts-sadr frederikschubert dkorenkevych bernwang futurev ys610zz wentaoyuan chaohuang-ch xrosliang mldl erichuang2013 5l1v3r1 wendyshang agos1300 bangyou01 dmta9796 suvrajeet01 orangeadegit sandguine chungshan lars12llt fiberleif obin-hero zheyangxu camall3n shahrutav aivanni jeyhooon mariodoebler wx-b changmin-yu guyjacob jxzhangjhu working-girl wassname neevparikh jmfanbu minakhan01 teshnizi trinh-hoang-hiep nik7273 zxp-s-works renmengye davidkillerhahaha tejassp2002 lbeki01 vonhartz ymjs-irfan forallx94 nam630 vinayasathyanarayana namjiwon1023 slienteagle-wyb aicools jdchang1 yangsenwxy benhoff thanhkaist zerlinwang iamlab-cmu baywc568 winnull cryptowealth-technology mathisclautrier dexiongyung naumix penelope-zhang jingyisu jonychoi whuhxb paulvantieghem freekang kie-horiuchi pranaboy72 agb24 iq-scm shilu10 qqd99 jwcy0529 tonydev-ml ozaki39 oriole18 andreslavescu vbarbaros r-xian

curl's Issues

About the equipment

Hi, thanks for sharing your code. I want to ask what is the configuration of the machine on which the code is running

Sorry that l cannot reproduce some DMC results in the paper

Dear CURL authors,

Thanks for such a big-impact work and released code ！

Following the hyper-parameters from table 3 in the Implementation Details of appendix, l run each reported game for five seeds.
The results are:
500K steps score | Our results | CURL paper
Finger, Spin | 828 +/- 137 | 926 +/- 45
Cartpole, Swingup | 809 +/- 39 | 841 +/- 45
Reacher, Easy | 951 +/- 27 | 929 +/- 44
Cheetah, Run | 526 +/- 59 | 518 +/- 28
Walker, Walk | 892 +/- 49 | 902 +/- 43
Ball in Cup, Catch | 846 +/- 103 | 959 +/- 27

From the above results, we find in some games (e,g. Finger, Spin, Ball in Cup), the mean score is lower than your results, and the
std is relatively high.

Besides, l find the 100K and 500K results of Pixel SAC are almost the same in Table 1 of the paper.

Have you met these questions when you run current codebase? Thank you so much！

Is display necessary?

When I run the code, errors occur.

CRITICAL:absl:Shadow framebuffer is not complete, error 0x8cd7
CRITICAL:absl:Could not allocate display lists
CRITICAL:absl:Could not allocate display lists

Must I run on the machine with a display? Can the code be changed to go without a display?

Questions about infinit bootstrap

Hi, thank you for your code. I'm a little bit confused of the infinit bootstrap in

curl/train.py

Line 269 in 8416d6e

done_bool = 0 if episode_step + 1 == env._max_episode_steps else float(

.
Will it be wrong when sampling at the end of an episode (where the next_obs is the start observation of the next episode)? It seems you simply ignore this.

Generating the labels with torch.arange

First of all, thank you so much for kindly sharing your great research and also the code.

However, I have one question regarding the labels generation from the logits using the following code (curl_sac.py line 424):

labels = torch.arange(logits.shape[0]).long().to(self.device)

What if, for example, we get several same observations in the batch sampled from the replay buffer? Isn't the code will set same features as different classes since we use torch.arange?

Please correct me if I am wrong. Thank you so much.

Do you know why I might be getting this error when I run train.py?

Edit: Never mind, had missing dependencies.

FileNotFoundError

Hello, I followed the README to run
bash scripts/run.sh

This is what I got:
FileNotFoundError: [Errno 2] No such file or directory: './tmp/cartpole/cartpole-swingup-05-10-im84-b128-s482469-pixel/args.json'

Environment step count with frame-skip

Great work and thanks a lot for releasing the code! It’s awesome to see this simple contrastive loss term performing so well without the need for reconstruction.

Quick question regarding the environment step count: if we consider a DMC episode of standard length 1000 steps and we use a frameskip of 4, do the reported results consider the episode to have 1000 steps or 250 steps? Put differently, do the 100k step results mean 100k “low-level DMC” steps or 100k “agent-applying-an-action” steps?

when bash scripts/run.sh

FileNotFoundError: [Errno 2] No such file or directory: './tmp/cartpole/cartpole-swingup-06-22-im84-b128-s202969-pixel/args.json'

What is "ema"?

In CURL.encode, what is the arg "ema"?

Memory Usage Increasing Continuously

Hi! Great paper btw!
When I run the code the RAM usage continuously increases. I am running the default code with no changes. The memory consumption keeps on increasing and after 8k iterations, the OS kills the process. My PC specs: Intel i7 processor, Nvidia RTX2070 GPU, 16GB RAM. Can you please help me out? Thank you.

Rainbow DQN + CURL

Will there be scripts for discrete/Atari environments?

A bug: Segmentation fault (core dumped)

Hi, thank you for sharing your code.
When I run run.sh in a ubuntu server, I got a error:

warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
CRITICAL:absl:OpenGL version 1.5 or higher required
CRITICAL:absl:OpenGL ARB_framebuffer_object required
CRITICAL:absl:OpenGL ARB_vertex_buffer_object required
./scripts/run.sh: line 10: 4276 Segmentation fault (core dumped)

But when I run it in my own computer, the error didn't appear. I can't fix it in the server, I'd appreciate any help on this error.

Optimising encoder twice during CURL?

Thanks for sharing your code, it's great to be able to go through the implementation.

Maybe I'm misunderstanding this, but it seem that if you intend self.cpc_optimizer to only optimise W, then

self.cpc_optimizer = torch.optim.Adam(
    self.CURL.parameters(), lr=encoder_lr
)

should be

self.cpc_optimizer = torch.optim.Adam(
    self.CURL.parameters(recursive=False), lr=encoder_lr
)

self.cpc_optimizer = torch.optim.Adam(
    [self.CURL.W], lr=encoder_lr
)

The code I'm referring to is here and the torch docs for parameter are here. And I'm comparing it to section 4.7 of your paper.

As it stands it seems that encoder is optimised twice, once in encoder_optimizer and again in cpc_optimizer.

Or am I missing something?

Cannot train for Google football environment.

I've been trying to implement CURL for a different environment than that of DeepMind Suite which is Google football environment. But I've been getting errors regarding action_shape,obs_shape and channels.

1) Issue with channels:

RuntimeError: Given groups=1, weight of size 32 6 3 3, expected input[1, 144, 84, 3] to have 6 channels, but got 144 channels instead.

2) Issue while assigning value of action shape from that of the environments:

Traceback (most recent call last):
File "train.py", line 291, in
main()
File "train.py", line 226, in main
device=device
File "train.py", line 148, in make_agent
curl_latent_dim=args.curl_latent_dim
File "/home/atharva/CURL/curl/curl_sac.py", line 285, in init
num_layers, num_filters
File "/home/atharva/CURL/curl/curl_sac.py", line 73, in init
nn.Linear(hidden_dim, 2 * action_shape[0])
IndexError: tuple index out of range

3) Issue with PixelEncoder :

Traceback (most recent call last):
File "train.py", line 292, in
main()
File "train.py", line 240, in main
evaluate(env, agent, video, args.num_eval_episodes, L, step,args)
File "train.py", line 116, in evaluate
run_eval_loop(sample_stochastically=False)
File "train.py", line 101, in run_eval_loop
action = agent.select_action(obs)
File "/home/atharva/CURL/curl/curl_sac.py", line 355, in select_action
obs, compute_pi=False, compute_log_pi=False
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/CURL/curl/curl_sac.py", line 82, in forward
obs = self.encoder(obs, detach=detach_encoder)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/CURL/curl/encoder.py", line 67, in forward
h_fc = self.fc(h)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [1 x 672], m2: [39200 x 50] at /tmp/pip-req-build-ocx5vxk7/aten/src/THC/generic/THCTensorMathBlas.cu:290

4) Issue with Padded input :

Traceback (most recent call last):
File "train.py", line 292, in
main()
File "train.py", line 240, in main
evaluate(env, agent, video, args.num_eval_episodes, L, step,args)
File "train.py", line 116, in evaluate
run_eval_loop(sample_stochastically=False)
File "train.py", line 101, in run_eval_loop
action = agent.select_action(obs)
File "/home/atharva/CURL/curl/curl_sac.py", line 355, in select_action
obs, compute_pi=False, compute_log_pi=False
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/CURL/curl/curl_sac.py", line 82, in forward
obs = self.encoder(obs, detach=detach_encoder)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/CURL/curl/encoder.py", line 62, in forward
h = self.forward_conv(obs)
File "/home/atharva/CURL/curl/encoder.py", line 55, in forward_conv
conv = torch.relu(self.convsi)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (21 x 1). Kernel size: (3 x 3). Kernel size can't be greater than actual input size.

5)Issue with Action in action set :

_Traceback (most recent call last):
File "train.py", line 292, in
main()
File "train.py", line 240, in main
evaluate(env, agent, video, args.num_eval_episodes, L, step,args)
File "train.py", line 116, in evaluate
run_eval_loop(sample_stochastically=False)
File "train.py", line 102, in run_eval_loop
obs, reward, done, _ = env.step(action)
File "/home/atharva/CURL/curl/utils.py", line 226, in step
obs, reward, done, info = self.env.step(action)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gym/core.py", line 234, in step
return self.env.step(action)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gym/core.py", line 280, in step
observation, reward, done, info = self.env.step(action)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gym/core.py", line 268, in step
observation, reward, done, info = self.env.step(action)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gym/core.py", line 268, in step
observation, reward, done, info = self.env.step(action)
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gfootball/env/football_env.py", line 177, in step
_, reward, done, info = self._env.step(self.get_actions())
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gfootball/env/football_env_core.py", line 160, in step
for a in action
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gfootball/env/football_env_core.py", line 160, in
for a in action
File "/home/atharva/anaconda3/envs/curl/lib/python3.6/site-packages/gfootball/env/football_action_set.py", line 217, in named_action_from_action_set
assert False, "Action {} not found in action set".format(action)
AssertionError: Action -0.049828674644231796 not found in action set

It just seems trying to solve one gives a rise to another one. Can you please let me know how could these issues be resolved ?

Thank you.

How to use this model with different environment

An error when computing CURL

I noticed when reading through the paper and the code that your pseudocode in the paper says that the key encoder needs to be detached from the graph but in your actual code you don't set detach = True for z_pos = self.CURL.encode(obs_pos, ema=True). I wanted to know whether the paper or code is correct. Or maybe I am missing some part of the computation.

This is what is in the code for curl_sac.py:

def update_cpc(self, obs_anchor, obs_pos, cpc_kwargs, L, step):
        
        z_a = self.CURL.encode(obs_anchor) 
        z_pos = self.CURL.encode(obs_pos, ema=True)
        
        logits = self.CURL.compute_logits(z_a, z_pos)
        labels = torch.arange(logits.shape[0]).long().to(self.device)
        loss = self.cross_entropy_loss(logits, labels)
        
        self.encoder_optimizer.zero_grad()
        self.cpc_optimizer.zero_grad()
        loss.backward()

        self.encoder_optimizer.step()
        self.cpc_optimizer.step()
        if step % self.log_interval == 0:
            L.log('train/curl_loss', loss, step)

and this is what is in the pseudocode for the paper:

for x in loader: 
    x_q = aug(x)
    x_k = aug(x)
    z_q = f_q.forward(x_q)
    z_k = f_k.forward(x_k)
    z_k = z_k.detach()
    proj_k = matmul(W, z_k.T)
    logits = matmul(z_q, proj_k)
    logits = logits - max(logits, axis=1)
    labels = arange(logits.shape[0])
    loss = CrossEntropyLoss(logits, labels)
    loss.backward()
    update(f_q.params)
    update(W)
    f_k.params = m*f_k.params+(1-m)*f_q.params

retain_graph=True

Thanks for you great work. I have a problem when I want to modify the code. It hinted that I must to use retain_graph=True. Where am I wrong perhaps?

I know the issue will get no reply , but where is the moco in the code?

I understand that the response might be delayed, but I'm having difficulty locating the MoCo implementation in the CURL codebase. Could you kindly point me to the relevant section or file where MoCo is implemented? Thank you for your assistance.

Using cross-entropy loss does not penalize negative samples

Hi,

I see that you use the cross-entropy(CE) loss for the contrastive learning. As far as I understand, this does not penalize the negative samples, as the CE loss gives zero weights to the non-diagonal entries in the [B, B] matrix. Do I make any mistake?

Best,
Sherwin

Momentum update is never used in the code

Hi, I was going thought the code and couldn't find where momentum encoder was being updated, I think it is initialized only once at the beginning and then isn't trained at all

A bug when cropping images?

Hi, thank you for your great research!
I'm afraid I think there is a bug at the random_crop function in utils.py:

curl/utils.py

Lines 244 to 245 in 23b0880

 w1 = np.random.randint(0, crop_max, n) 

 h1 = np.random.randint(0, crop_max, n)

I think the crop_max should be modified as crop_max + 1.
If left as it is, the bottom and rightmost columns of the image are not included in cropped_image. Also, an error occurs when output_size==img_size.
I'm sorry if I'm wrong :)

encoder updated twice for curl loss

I'm looking into the code and find that in def update_cpc() both self.encoder_optimizer.step() and self.cpc_optimizer.step() are called. However the parameters of critic.encoder are carried by both optimizer. Isn't it true that, in def update_cpc(), critic.encoder is updated twice using the same gradient?

Can we integrate the update_critic function and update_cpc function together?

Hi, can we integrate the update_critic function and update_cpc function by adding the critic_loss and cpc_loss together?
Meanwhile, we only need two optimizers.
Is it feasible?

self.cpc_optimizer = torch.optim.Adam([self.CURL.W], lr=encoder_lr)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr, betas=(critic_beta, 0.999))
loss = critic_loss + cpc_loss
loss.backward()
self.critic_optimizer.step()
self.cpc_optimizer.step()

WARN precision

I got WARN: Box bound precision lowered by casting to float32 when I run the code. Should this be fixed?

Trouble reproducing results. Help much appreciated~

Hello! Thank you so much for putting up this valuable resource! I was wondering if I may ask for some kind advice about replicating the results, which I have been unable to do.

Mainly, I have been testing CURL (using the default settings + command listed on https://github.com/MishaLaskin/curl) against CURL with the following lines commented out (which should give me pixel SAC):

    # if step % self.cpc_update_freq == 0 and self.encoder_type == 'pixel':
    #     obs_anchor, obs_pos = cpc_kwargs["obs_anchor"], cpc_kwargs["obs_pos"]
    #     self.update_cpc(obs_anchor, obs_pos,cpc_kwargs, L, step,0)

For [cartpole, swingup], I obtained ~ 850 for CURL but strangely I also obtained ~850 (and very quickly too) for pixel SAC. These results showing no difference were replicated over 5 seeds and very robust. Is my code change correct, or have I manipulated the code in the wrong way?

For the task [finger,spin] I obtained ~ 350 for both CURL and pixel SAC, also no difference.

Thank you in advance for the kind help! :)

	w1 = np.random.randint(0, crop_max, n)
	h1 = np.random.randint(0, crop_max, n)