marlbenchmark / on-policy Goto Github PK

View Code? Open in Web Editor NEW

1.2K 7.0 284.0 240 KB

This is the official implementation of Multi-Agent PPO (MAPPO).

Home Page: https://sites.google.com/view/mappo

License: MIT License

Python 83.05% CMake 0.06% Shell 3.75% C++ 12.26% C 0.87%

hanabi mappo smac mpes starcraftii ppo multi-agent algorithms

on-policy's People

Contributors

Stargazers

Watchers

Forkers

wwxfromtju bill007bill sjtuwbl reinholdm yangfengwxy songbo-1992 zhangtjtongxue wsg1873 p90-rushb bjtu1202techshare wnight963 zcchenvy xrosliang wanghuimu tianqi-777 wushilan qiu1234567 zoeyuchao fmxfranky hg-tc fcu-d0573754 hzcirving hertzkim gavin231mao eugenevinitsky zhuwenjie911 herveyrobot aicools longjohncoder jz44 joshwlks testmonkey02 lovedlwujing johan-kallstrom abdulhaim emptyheart2 jackokiezhao junfenggo jc-bao butterfly2sea minglou1984 running-mars shikiye seven-deng xaveee hrxweb ultronai jiayu-ch15 miemy13 simcs prememore benellis3 tanxiangtj jensenlzx huanghaoyu1997 namjiwon1023 dlml tingz515 hsuth1996 sheoliy wenguixuan gxy2017 ltzheng caoxixiya halcyone215 gitshitou douglascvas xiaossuper qiangyuchuan jackory longlivemerlin sht617641946 lyrorz qingyuangao212 sagardsaxena ffxu1024 zhouhnag wenhaoma-uts faithmai agent-3154 xiaohuojianchendiwen btx0424 beamiter yyds-xtt zerlinwang kcorder cryptowealth-technology xieshaocong-ethan hejichao2020 li-ming-fan ma-dan sihongho hanhanchan qyxqyx sjwcee crazy5288299 erlebnisw samuelebolotta jubgjf kumaava

on-policy's Issues

Does the param 'n_rollout_threads' influence the final performance greatly?

Hello, when I read and run the rmappo with default parameters in the 'train_mpe.sh', I find that the convergence rate becomes slow in the simple_spread when I set 'n_rollout_threads=20' instead of original '128'. So I want to ask for advice from the author about the influence of 'n_rollout_threads' to the algorithm's performance. In my opinion, this param decides the number of the parallel envs and the numbe of samples during per ppo-update. I am looking forward to hearing from the author.

Is MAPPO similar or same with the Parameter sharing framework in one earlier work?

Hi,
Thanks for your work! But I am a bit confused about whether the MAPPO is the same as the "parameter sharing framework" proposed in "http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf"? That is published in 2017, maybe in your great paper should refer to this work or compare it.

thanks!

Centralized-V between IPPO and MAPPO

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks for your work!

Just a quick question: is turning on or off the use_centralized_V only affects if the input is from local observations or the centralized state? Does it affect the structure of actual value network used? from the code what I can see is no matter use_centralized_V is true or false, the input size is always (num_agents, input_dim), and the output values are with size (num_agents, 1). So the networks are not affected by use_centralized_V. And centralized value outputs will be the same for "num_agents", right?

Look forward to your reply! Thank you!

Best,
Xubo

Can you open-source MASAC code base?

Hello,
Thanks for open-sourcing a really good work. I was wondering if you guys can open-source the MASAC code base as it would help to understand the variations of MASAC with MADDPG and MAPPO. Thanks, in advance for the help.

how render

cannot reproduce the performance of MPE

Hi, I have an issue when reproducing the performance of simple_spread in MPE.

The only modifications on your code:

use --use_wandb to disable wandb in train_mpe.sh
add self.envs.reset() before line 26 in mpe_runner.py

use_centralized_V

Hi, dear author.
I find that each agent using the independent V value funtion (agent_specific_state) in the code?. Why is it called 'use_centralized_V'?
In my opinion, 'use_centralized_V' means all agents share the same V value function.
Thanks!

Clarification on Environment Names

What is the difference between Hanabi-Full and Hanabi-very-small?

Being confused about The huber_loss

the function huber_loss in utils is like:

def huber_loss(e, d):
    a = (abs(e) <= d).float()
    b = (e > d).float()
    return a*e**2/2 + b*d*(abs(e)-d/2)

It may come with a zero loss when error is greater than huber_delta.

If I'm not mistaken，it should be
b = (abs(e) > d).float()

Looking forward to hearing from you.

Hyperparameters of IPPO

Are the choices of IPPO hyperparameters the same as MAPPO shown in Table 12? The only difference is the value of "use_centralized_V" (False for IPPO, True for MAPPO), right? Thanks!

How to train MAPPO on MPE with the separated policy?

I have trained MAPPO on simple_spread with the separated policy, but the results are not stable.
Could the authors help me explain the problem of hyperparameter settings?

Values of Hyperparameters

Could you provide the values of "--use_popart", "--use_valuenorm", "--use_value_active_masks", "--use_policy_active_masks" across all SMAC maps to help better reproduce your results? Thanks a lot!

PopArt implementation is much different from the paper you cited !

I just read about the original paper that come up with "PopArt", their idea is mainly to keep the normalization factors from affecting learning process by modfying "W, b" of the last layer, but your implementation is just a z-transformation for value normalization, which actually hurts training in my experiments. Maybe you should respect the original PopArt paper?

Error in setting `bad_transition`

on-policy/onpolicy/envs/starcraft2/StarCraft2_Env.py

Lines 560 to 577 in 0483adc

 elif self._episode_steps >= self.episode_limit: 

 # Episode limit reached 

 terminated = True 

 self.bad_transition = True 

 if self.continuing_episode: 

 info["episode_limit"] = True 

 self.battles_game += 1 

 self.timeouts += 1 

 for i in range(self.n_agents): 

 infos[i] = { 

 "battles_won": self.battles_won, 

 "battles_game": self.battles_game, 

 "battles_draw": self.timeouts, 

 "restarts": self.force_restarts, 

 "bad_transition": bad_transition, 

 "won": self.win_counted 

 }

It should be bad_transition = True in line 563.

Differences between the starcraft environment and environment in SMAC

Hi authors,

Really appreciate this nice codebase for MAPPO. I have one issue that I find out the environment used in MAPPO is somehow different with the environment in SMAC. For example, you can compare how this environment and starcraft in SMAC implement the global_state function:

https://github.com/oxwhirl/smac/blob/a54ebb937e44dc9c703d96064f423069532c7b66/smac/env/starcraft2/starcraft2.py#L1135

on-policy/onpolicy/envs/starcraft2/StarCraft2_Env.py

Line 1152 in 98c40d3

def get_state(self, agent_id=-1):

The get_state function in SMAC is the real global state while the get_state function in your starcraft is still an agent-dependent state. I am wondering why these two environments are different.

QMIX performance in Hanabi

I read and learn interesting insights from the MAPPO paper. I have a few questions regarding the paper (and its previous version).

I saw the performance of QMIX in Hanabi reported in a previous version of the paper which was very low (0.29 in small version). I'm curious why it performed so badly? The results reported in the current version with VDN performs way better despite QMIX being superset of VDN. I understand that the current paper use the full version of the game, still, its score almost reaches the optimum while QMIX (in small version) doesn't seem to learn meaningful policy. Is it possible to get similar performance to VDN using QMIX?

Evaluation ONLY mode in MAPE environment

From what I understand, in MAPE, evaluation is "entangled" in the training mode (there is alteration between training and evaluation phase). Is there any way (or a script) that I can evaluate only a trained agent and save gifs/videos etc?

Is seperated runner fully supported?

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks a lot for your great work! It helps a lot in synchronizing the performance of the on-policy DRL algorithm from single to multi-agent cases.

I found out there are some missing files about separated runner working on specific environments (e.g. SMAC or MPE), in the folder "onpolicy/runner/seperated/xxx". But I noticed that there is a base_runner implemented. So I was wondering if the separate runner (maybe plus the separate buffer) is a ready-to-go implementation here with PPO optimization? If so, I can directly use it with my customized environment and run the experiments, similarly to the use of shared runner.

Look forward to your reply! Thanks!

Best,
Xubo

envs reset in data-collecting and evaluation period

Both in data collecting and evaluation period, when an episode terminated, the model just take the last obs of corresponding env as input. But i think the envs should be reset if reach termination. Or i missed something in the code?

separate runner for SMAC

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks a lot for your work!

I wonder if you have the code of a separate runner for the SMAC task? Will be very helpful！

Thanks！
Xubo

About Hanabi Implementation Details

Why throw away the reward of step 0 when collecting samples for Hanabi Environment?

  def collect(self, step):
      for current_agent_id in range(self.num_agents):
          ...
          # rearrange reward
          # reward of step 0 will be thrown away.
          self.turn_rewards[choose, current_agent_id] = self.turn_rewards_since_last_action[choose, current_agent_id].copy()
          self.turn_rewards_since_last_action[choose, current_agent_id] = 0.0
          self.turn_rewards_since_last_action[choose] += rewards[choose]
          ...

"Vec_env_wrapper" and "Share_vec_evn" not defined in "env_wrappers.py"

很好的工作，但是... Good work, but ...

如果你们让QMIX用8个进程，增大Batch Size和或者每次训练epoch次数，最后加上 TD(lambda)<=0.5
QMIX能把这些算法干趴下
参考我们的简略调参 https://arxiv.org/abs/2102.03479
我真的发现 MARL这个领域由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了，涉及十篇+ CCFA顶会paper
尤其是 AAAI 这个会议的文章连证明是错的都能 accept

An error occurs when I run rmappo on football

The output of python is listed here:

Traceback (most recent call last):
File "train/train_football.py", line 203, in
main(sys.argv[1:])
File "train/train_football.py", line 188, in main
runner.run()
File "/onpolicy/runner/shared/football_runner.py", line 43, in run
self.insert(data)
File /onpolicy/runner/shared/football_runner.py", line 141, in insert
masks=masks
TypeError: insert() got an unexpected keyword argument 'rnn_states'

cannot reproduce the performance of MPE

Hi, I have an issue when reproducing the performance of simple_spread in MPE.

The only modifications on your code:

use --use_wandb to disable wandb in train_mpe.sh
add self.envs.reset() before line 26 in mpe_runner.py

continuous action space

Hi, can MAPPO be used for continuous action space? How can I do this?When I change discrete_action under environment.py to False, the following error will appear.

The problem about actor loss

The mappo works well in my environment. The reward function can increase. The critic loss can also be decreased to convergence. However, why my actor loss increases to convergence, shouldn't it decrease? Could you please explain my question? Thank you.

A few questions about death masking

Could you confirm if I understood it right according to #1:

You return death mask which is used only for policy to calculate advantages, policy loss and entropy plus change global state for agent. And thats all? I tried similar thing with IPPO and didn't work good.
Could you point out to the feature pruning vs default env global state in the code?

Thanks,
Denys

In the definition of FixedNormal class have some wrong code.

# Normal
class FixedNormal(torch.distributions.Normal):
def log_probs(self, actions):
return super().log_prob(actions).sum(-1, keepdim=True)
def entrop(self):
return super.entropy().sum(-1)
def mode(self):
return self.mean

How to visualize the .wandb and the others

I hope i can visualize the average_episode_rewards,but in this folder I can only see a file named "events.out.tfevents.xxx.amax".I hope i can open this file,is there somewhere i should change? if you can tell me i would be appreciate.
i hope you can tell me how to open .wandb file

VecEnvWrapper is not defined in env_wrappers.py

permission denied

When reproducing the project, the last step. / train_ MPE has an error in requesting link data from the project owner. How can I solve it?

Missing base_runner.py in runner/separated

Hi~ When I try to run the speaker_listener env, I find the base_runner.py file is missing, could you provide this file?
Thanks!

About QMix(MG)

Hello! I found that in your new version of the MAPPO paper, you use a concatenation of the default environment global state, as well as all agents’ local observations, as the mixer network input. But why don't you instead concatenate the Feature-Pruned Agent-Specific Global States which is used in MAPPO to build the input of the mixer network? Is this unfair for the comparison?

Death Masking different from the paper

Hello!
I checked your paper and found in section 4.5 that the "death masking" is to simply use an agent-specific
constant vector, i.e., a zero vector with the agent’s ID, as the input to the value function after an agent dies. However, I can't find the corresponding code in this project. You seem to apply "active mask" to the entropy term, not the state.
Do I miss any important message in this project?

Using a global state

In the config.py file, there is this env parameter:

    parser.add_argument("--use_obs_instead_of_state", action='store_true',
                        default=False, help="Whether to use global state or concatenated obs")

I would like to use a global state in my code. However, I don't understand how the aforementioned parameter is being used, since the base_runner is the only script that extracts it (but then doesn't use it anyway). Thanks!

Trained Hanabi agents

Is it possible to get access to the 4 mappo trained actor and critic models in Hanabi that you refer to in the paper?

Many thanks!

GPU error

The code works in cpu but I get the below problem with gpu:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Full details are below:

env is Hanabi, algo is mappo, exp is mlp_critic1e-3_entropy0.015_v0belief, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
  File "train/train_hanabi_forward.py", line 176, in <module>
    main(sys.argv[1:])
  File "train/train_hanabi_forward.py", line 161, in main
    runner.run()
  File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 50, in run
    self.collect(step) 
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 153, in collect
    self.use_available_actions[choose])
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/rMAPPOPolicy.py", line 71, in get_actions
    deterministic)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py", line 62, in forward
    actor_features = self.base(obs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 54, in forward
    x = self.mlp(x)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 25, in forward
    x = self.fc1(x)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
training is done!

I get the error with both cuda 10.1 and 11.2. Do you have any thoughts on how to fix this? Many thanks!

wandb error

I met the error： wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)

I have changed the config.py about the --user-name to mine, but it doesn't work either.

Thank you for your reply.

training MAPPO and RMAPPO with MPE gives an error!

Hello,
Thanks for the code base and indeed a really good work!
I was trying to replicate the results from the paper and started with MPE environment with mappo and rmappo.
But I when I train the rmappo with MPE, I get the following error:

env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
  File "train/train_mpe.py", line 167, in <module>
    main(sys.argv[1:])
  File "train/train_mpe.py", line 152, in main
    runner.run()
  File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/mpe_runner.py", line 47, in run
    self.save()
  File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/base_runner.py", line 134, in save
    policy_vnorm = self.trainer.policy.value_normalizer
AttributeError: 'R_MAPPOPolicy' object has no attribute 'value_normalizer'

when I train the mappo with MPE, I get the following error:

env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
Traceback (most recent call last):
  File "train/train_mpe.py", line 167, in <module>
    main(sys.argv[1:])
  File "train/train_mpe.py", line 71, in main
    assert (all_args.use_recurrent_policy == False and all_args.use_naive_recurrent_policy == False), ("check recurrent policy!")
AssertionError: check recurrent policy!

Please let me know if there is any mistake I did. Thanks for the help!

AssertionError: check recurrent policy!

I have run your code:

./train_smac.sh

however, the following error occured:

env is StarCraft2, map is corridor, algo is mappo, exp is mlp, max seed is 1
seed is 1:
Traceback (most recent call last):
  File "train/train_smac.py", line 175, in <module>
    main(sys.argv[1:])
  File "train/train_smac.py", line 82, in main
    "check recurrent policy!")
AssertionError: check recurrent policy!

I changed the argument "algo" in file "tran_smac.sh"

algo="rmappo"

I don't know whether this modification is suitable or not. It did work, but the results were not satisfactory:

Do you have some good parameters for trainnning?

`eval_average_episode_rewards` is not defined in shared/mpe_runner.py

This is a bug that I fixed by taking a look at the other onpolicy code that you have. In shared/mpe_runner.py in def eval() almost at the end of the function:

        eval_episode_rewards = np.array(eval_episode_rewards)
        eval_env_infos = {}
        eval_env_infos['eval_average_episode_rewards'] = np.sum(np.array(eval_episode_rewards), axis=0)
        # print("eval average episode rewards of agent: " + str(eval_average_episode_rewards))

The eval_average_episode_rewards is not defined and code will exit with error. Instead I used:

print("eval average episode rewards of agent: " + str(np.mean(eval_env_infos['eval_average_episode_rewards'])))

This is different than in separated/mpe_runner.py:

        eval_train_infos = []
        for agent_id in range(self.num_agents):
            eval_average_episode_rewards = np.mean(np.sum(eval_episode_rewards[:, :, agent_id], axis=0))
            eval_train_infos.append({'eval_average_episode_rewards': eval_average_episode_rewards})
            print("eval average episode rewards of agent%i: " % agent_id + str(eval_average_episode_rewards))

but i guess the logic is that in the shared case agent1 and agent2 are the same so the average reward between their performance is reasonable.

Why use self.buffer[agent_id].after_update()?

def train(self):
train_infos = []
for agent_id in range(self.num_agents):
self.trainer[agent_id].prep_training()
train_info = self.trainer[agent_id].train(self.buffer[agent_id])
train_infos.append(train_info)
self.buffer[agent_id].after_update()

    return train_infos

why use after_update() after train()?

Cannot reproduce the MPE results

Hi,
Thanks for your contribution to the community. I directly run the given example on MPE (Simple-Spread) using the default parameters except for changing the 'n_rollout_threads' into 8 due to the computation limit. However, the results on the W&B can only achieve around -170, which seems not working.

Has anybody met such a problem? Could you help me fix it?

Thanks!

W & B errors about running train_mpe

When running the train_mpe file, prompts W & B, I choose to do not need visual results, but run errors, run this program to use W & B? because I have been using Tensorboard, I haven't used W & B, I don't know how to use it. Whether this program must be used to use W & B, can you join the Tensorboard selection, thank you!

Metric problem about MAPPO

Nice paper and project! And we are also doing some research on this topic.

We find there is something confused in the original MAPPO paper: why the shadow parts of the following figure in page 7 are more than win rate 1.0? What kind of metric you use in this figure? Can you offer some detailed statements? Thank you!

The wandb backend process has shutdown

Has anyone had a problem similar to me ？I ran this code on the Colab platform。
/content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts
env is MPE, scenario is simple_spread, algo is rmappo, exp is check, max seed is 1
seed is 1:
choose to use cpu...
wandb: Currently logged in as: yi-li (use wandb login --relogin to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run rmappo_check_seed1
wandb: ⭐️ View project at https://wandb.ai/yi-li/MPE
wandb: 🚀 View run at https://wandb.ai/yi-li/MPE/runs/1t4tb6nn
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 149, in check_network_status
status_response = self._interface.communicate_network_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 125, in communicate_network_status
resp = self._communicate_network_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 397, in _communicate_network_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 167, in check_status
status_response = self._interface.communicate_stop_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 114, in communicate_stop_status
resp = self._communicate_stop_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 387, in _communicate_stop_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

and this is my debug.log:
2022-03-03 09:28:43,805 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /root/.config/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from environment variables: {}
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Inferring run settings from compute environment: {'program_relpath': 'train/train_mpe.py', 'program': 'train/train_mpe.py'}
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():405] Logging user logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug.log
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():406] Logging internal logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug-internal.log
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():439] calling init triggers
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():443] wandb.init called with sweep_config: {}
config: {'algorithm_name': 'rmappo', 'experiment_name': 'check', 'seed': 1, 'cuda': True, 'cuda_deterministic': True, 'n_training_threads': 1, 'n_rollout_threads': 128, 'n_eval_rollout_threads': 1, 'n_render_rollout_threads': 1, 'num_env_steps': 20000000, 'user_name': 'yi-li', 'use_wandb': True, 'env_name': 'MPE', 'use_obs_instead_of_state': False, 'episode_length': 25, 'share_policy': True, 'use_centralized_V': True, 'stacked_frames': 1, 'use_stacked_frames': False, 'hidden_size': 64, 'layer_N': 1, 'use_ReLU': False, 'use_popart': False, 'use_valuenorm': True, 'use_feature_normalization': True, 'use_orthogonal': True, 'gain': 0.01, 'use_naive_recurrent_policy': False, 'use_recurrent_policy': True, 'recurrent_N': 1, 'data_chunk_length': 10, 'lr': 0.0007, 'critic_lr': 0.0007, 'opti_eps': 1e-05, 'weight_decay': 0, 'ppo_epoch': 10, 'use_clipped_value_loss': True, 'clip_param': 0.2, 'num_mini_batch': 1, 'entropy_coef': 0.01, 'value_loss_coef': 1, 'use_max_grad_norm': True, 'max_grad_norm': 10.0, 'use_gae': True, 'gamma': 0.99, 'gae_lambda': 0.95, 'use_proper_time_limits': False, 'use_huber_loss': True, 'use_value_active_masks': True, 'use_policy_active_masks': True, 'huber_delta': 10.0, 'use_linear_lr_decay': False, 'save_interval': 1, 'log_interval': 5, 'use_eval': False, 'eval_interval': 25, 'eval_episodes': 32, 'save_gifs': False, 'use_render': False, 'render_episodes': 5, 'ifi': 0.1, 'model_dir': None, 'scenario_name': 'simple_spread', 'num_landmarks': 3, 'num_agents': 3}
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():492] starting backend
2022-03-03 09:28:43,809 INFO MainThread:4971 [backend.py:_multiprocessing_setup():101] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-03-03 09:28:43,831 INFO MainThread:4971 [backend.py:ensure_launched():219] starting backend process...
2022-03-03 09:28:43,840 INFO MainThread:4971 [backend.py:ensure_launched():225] started backend process with pid: 4987
2022-03-03 09:28:43,851 INFO MainThread:4971 [wandb_init.py:init():501] backend started and connected
2022-03-03 09:28:43,867 INFO MainThread:4971 [wandb_init.py:init():565] updated telemetry
2022-03-03 09:28:43,873 INFO MainThread:4971 [wandb_init.py:init():596] communicating run to backend with 30 second timeout
2022-03-03 09:28:45,203 INFO MainThread:4971 [wandb_run.py:_on_init():1759] communicating current version
2022-03-03 09:28:45,250 INFO MainThread:4971 [wandb_run.py:_on_init():1763] got version response
2022-03-03 09:28:45,251 INFO MainThread:4971 [wandb_init.py:init():625] starting run threads in backend
2022-03-03 09:28:49,989 INFO MainThread:4971 [wandb_run.py:_console_start():1733] atexit reg
2022-03-03 09:28:49,993 INFO MainThread:4971 [wandb_run.py:_redirect():1606] redirect: SettingsConsole.REDIRECT
2022-03-03 09:28:49,994 INFO MainThread:4971 [wandb_run.py:_redirect():1611] Redirecting console.
2022-03-03 09:28:50,000 INFO MainThread:4971 [wandb_run.py:_redirect():1667] Redirects installed.
2022-03-03 09:28:50,001 INFO MainThread:4971 [wandb_init.py:init():664] run started, returning control to user process

TypeError: cannot assign 'torch.FloatTensor' as parameter 'stddev' (torch.nn.Parameter or None expected)

Running train_mpe

The code reports an error

Line 63 in the file "onpolicy/algorithms/utils/popart.py"

    self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
    self.weight = self.weight * old_stddev / self.stddev
    self.bias = (old_stddev * self.bias + old_mean - self.mean) / self.stddev

File "/on-policy-main/onpolicy/algorithms/utils/popart.py", line 63, in update
self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
File "/home/cc/anaconda3/envs/MAPPO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 613, in setattr
.format(torch.typename(value), name))
TypeError: cannot assign 'torch.FloatTensor' as parameter 'stddev' (torch.nn.Parameter or None expected)

Whether the above code needs to be changed to the following code:

    self.stddev = nn.Parameter((self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4))

    self.weight = nn.Parameter(self.weight * old_stddev / self.stddev)
    self.bias = nn.Parameter((old_stddev * self.bias + old_mean - self.mean) / self.stddev)

code error about value norm

In r_mappo.py, line 175

        if self._use_popart:
            advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
        else:
            advantages = buffer.returns[:-1] - buffer.value_preds[:-1]

should be fixed

        if self._use_popart or self._use_valuenorm:
            advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
        else:
            advantages = buffer.returns[:-1] - buffer.value_preds[:-1]

This error collapses the performance of the PPO.

Others

It seems that the paper mentions the use of PopArt, but it is not actually used in the code.
Finally, MAPPO (centralized V) is mentioned in the abstract of the paper, but it is actually IPPO with global information because the value function is not centralized.

Thanks.

在MAPPO中运行MPE的speaker_listener在7个智能体环境中的实验问题

你好，我想在MAPPO中运行MPE的speaker_listener在7个智能体环境中的实验，但是程序提示我只能支持两个智能体，那么这个代码不支持speaker_listener下7个智能体的实验吗

	elif self._episode_steps >= self.episode_limit:
	# Episode limit reached
	terminated = True
	self.bad_transition = True
	if self.continuing_episode:
	info["episode_limit"] = True
	self.battles_game += 1
	self.timeouts += 1

	for i in range(self.n_agents):
	infos[i] = {
	"battles_won": self.battles_won,
	"battles_game": self.battles_game,
	"battles_draw": self.timeouts,
	"restarts": self.force_restarts,
	"bad_transition": bad_transition,
	"won": self.win_counted
	}

marlbenchmark / on-policy Goto Github PK

on-policy's People

Contributors

Stargazers

Watchers

Forkers

on-policy's Issues

Recommend Projects

Recommend Topics

Recommend Org