marlbenchmark / on-policy Goto Github PK
View Code? Open in Web Editor NEWThis is the official implementation of Multi-Agent PPO (MAPPO).
Home Page: https://sites.google.com/view/mappo
License: MIT License
This is the official implementation of Multi-Agent PPO (MAPPO).
Home Page: https://sites.google.com/view/mappo
License: MIT License
Hello, when I read and run the rmappo with default parameters in the 'train_mpe.sh', I find that the convergence rate becomes slow in the simple_spread when I set 'n_rollout_threads=20' instead of original '128'. So I want to ask for advice from the author about the influence of 'n_rollout_threads' to the algorithm's performance. In my opinion, this param decides the number of the parallel envs and the numbe of samples during per ppo-update. I am looking forward to hearing from the author.
Hi,
Thanks for your work! But I am a bit confused about whether the MAPPO is the same as the "parameter sharing framework" proposed in "http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf"? That is published in 2017, maybe in your great paper should refer to this work or compare it.
thanks!
Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu
Thanks for your work!
Just a quick question: is turning on or off the use_centralized_V only affects if the input is from local observations or the centralized state? Does it affect the structure of actual value network used? from the code what I can see is no matter use_centralized_V is true or false, the input size is always (num_agents, input_dim), and the output values are with size (num_agents, 1). So the networks are not affected by use_centralized_V. And centralized value outputs will be the same for "num_agents", right?
Look forward to your reply! Thank you!
Best,
Xubo
Hello,
Thanks for open-sourcing a really good work. I was wondering if you guys can open-source the MASAC code base as it would help to understand the variations of MASAC with MADDPG and MAPPO. Thanks, in advance for the help.
Hi, I have an issue when reproducing the performance of simple_spread in MPE.
The only modifications on your code:
--use_wandb
to disable wandb in train_mpe.sh
self.envs.reset()
before line 26 in mpe_runner.py
Hi, dear author.
I find that each agent using the independent V value funtion (agent_specific_state) in the code?. Why is it called 'use_centralized_V'?
In my opinion, 'use_centralized_V' means all agents share the same V value function.
Thanks!
What is the difference between Hanabi-Full and Hanabi-very-small?
the function huber_loss in utils is like:
def huber_loss(e, d):
a = (abs(e) <= d).float()
b = (e > d).float()
return a*e**2/2 + b*d*(abs(e)-d/2)
It may come with a zero loss when error is greater than huber_delta.
If I'm not mistaken,it should be
b = (abs(e) > d).float()
Looking forward to hearing from you.
Are the choices of IPPO hyperparameters the same as MAPPO shown in Table 12? The only difference is the value of "use_centralized_V" (False for IPPO, True for MAPPO), right? Thanks!
I have trained MAPPO on simple_spread with the separated policy, but the results are not stable.
Could the authors help me explain the problem of hyperparameter settings?
Could you provide the values of "--use_popart", "--use_valuenorm", "--use_value_active_masks", "--use_policy_active_masks" across all SMAC maps to help better reproduce your results? Thanks a lot!
I just read about the original paper that come up with "PopArt", their idea is mainly to keep the normalization factors from affecting learning process by modfying "W, b" of the last layer, but your implementation is just a z-transformation for value normalization, which actually hurts training in my experiments. Maybe you should respect the original PopArt paper?
on-policy/onpolicy/envs/starcraft2/StarCraft2_Env.py
Lines 560 to 577 in 0483adc
It should be bad_transition = True in line 563.
Hi authors,
Really appreciate this nice codebase for MAPPO. I have one issue that I find out the environment used in MAPPO is somehow different with the environment in SMAC. For example, you can compare how this environment and starcraft in SMAC implement the global_state function:
The get_state function in SMAC is the real global state while the get_state function in your starcraft is still an agent-dependent state. I am wondering why these two environments are different.
I read and learn interesting insights from the MAPPO paper. I have a few questions regarding the paper (and its previous version).
I saw the performance of QMIX in Hanabi reported in a previous version of the paper which was very low (0.29 in small version). I'm curious why it performed so badly? The results reported in the current version with VDN performs way better despite QMIX being superset of VDN. I understand that the current paper use the full version of the game, still, its score almost reaches the optimum while QMIX (in small version) doesn't seem to learn meaningful policy. Is it possible to get similar performance to VDN using QMIX?
From what I understand, in MAPE, evaluation is "entangled" in the training mode (there is alteration between training and evaluation phase). Is there any way (or a script) that I can evaluate only a trained agent and save gifs/videos etc?
Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu
Thanks a lot for your great work! It helps a lot in synchronizing the performance of the on-policy DRL algorithm from single to multi-agent cases.
I found out there are some missing files about separated runner working on specific environments (e.g. SMAC or MPE), in the folder "onpolicy/runner/seperated/xxx". But I noticed that there is a base_runner implemented. So I was wondering if the separate runner (maybe plus the separate buffer) is a ready-to-go implementation here with PPO optimization? If so, I can directly use it with my customized environment and run the experiments, similarly to the use of shared runner.
Look forward to your reply! Thanks!
Best,
Xubo
Both in data collecting and evaluation period, when an episode terminated, the model just take the last obs of corresponding env as input. But i think the envs should be reset if reach termination. Or i missed something in the code?
Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu
Thanks a lot for your work!
I wonder if you have the code of a separate runner for the SMAC task? Will be very helpful!
Thanks!
Xubo
Why throw away the reward of step 0 when collecting samples for Hanabi Environment?
def collect(self, step):
for current_agent_id in range(self.num_agents):
...
# rearrange reward
# reward of step 0 will be thrown away.
self.turn_rewards[choose, current_agent_id] = self.turn_rewards_since_last_action[choose, current_agent_id].copy()
self.turn_rewards_since_last_action[choose, current_agent_id] = 0.0
self.turn_rewards_since_last_action[choose] += rewards[choose]
...
如果你们让QMIX用8个进程,增大Batch Size和或者每次训练epoch次数,最后加上 TD(lambda)<=0.5
QMIX能把这些算法干趴下
参考我们的简略调参 https://arxiv.org/abs/2102.03479
我真的发现 MARL这个领域 由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了,涉及十篇+ CCFA顶会paper
尤其是 AAAI 这个会议的文章连证明是错的都能 accept
The output of python is listed here:
Traceback (most recent call last):
File "train/train_football.py", line 203, in
main(sys.argv[1:])
File "train/train_football.py", line 188, in main
runner.run()
File "/onpolicy/runner/shared/football_runner.py", line 43, in run
self.insert(data)
File /onpolicy/runner/shared/football_runner.py", line 141, in insert
masks=masks
TypeError: insert() got an unexpected keyword argument 'rnn_states'
Hi, I have an issue when reproducing the performance of simple_spread in MPE.
The only modifications on your code:
--use_wandb
to disable wandb in train_mpe.sh
self.envs.reset()
before line 26 in mpe_runner.py
The mappo works well in my environment. The reward function can increase. The critic loss can also be decreased to convergence. However, why my actor loss increases to convergence, shouldn't it decrease? Could you please explain my question? Thank you.
Could you confirm if I understood it right according to #1:
You return death mask which is used only for policy to calculate advantages, policy loss and entropy plus change global state for agent. And thats all? I tried similar thing with IPPO and didn't work good.
Could you point out to the feature pruning vs default env global state in the code?
Thanks,
Denys
# Normal
class FixedNormal(torch.distributions.Normal):
def log_probs(self, actions):
return super().log_prob(actions).sum(-1, keepdim=True)
def entrop(self):
return super.entropy().sum(-1)
def mode(self):
return self.mean
I hope i can visualize the average_episode_rewards,but in this folder I can only see a file named "events.out.tfevents.xxx.amax".I hope i can open this file,is there somewhere i should change? if you can tell me i would be appreciate.
i hope you can tell me how to open .wandb file
When reproducing the project, the last step. / train_ MPE has an error in requesting link data from the project owner. How can I solve it?
Hi~ When I try to run the speaker_listener env, I find the base_runner.py file is missing, could you provide this file?
Thanks!
Hello! I found that in your new version of the MAPPO paper, you use a concatenation of the default environment global state, as well as all agents’ local observations, as the mixer network input. But why don't you instead concatenate the Feature-Pruned Agent-Specific Global States which is used in MAPPO to build the input of the mixer network? Is this unfair for the comparison?
Hello!
I checked your paper and found in section 4.5 that the "death masking" is to simply use an agent-specific
constant vector, i.e., a zero vector with the agent’s ID, as the input to the value function after an agent dies. However, I can't find the corresponding code in this project. You seem to apply "active mask" to the entropy term, not the state.
Do I miss any important message in this project?
In the config.py file, there is this env parameter:
parser.add_argument("--use_obs_instead_of_state", action='store_true',
default=False, help="Whether to use global state or concatenated obs")
I would like to use a global state in my code. However, I don't understand how the aforementioned parameter is being used, since the base_runner is the only script that extracts it (but then doesn't use it anyway). Thanks!
Hi
Is it possible to get access to the 4 mappo trained actor and critic models in Hanabi that you refer to in the paper?
Many thanks!
Hi
The code works in cpu but I get the below problem with gpu:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Full details are below:
env is Hanabi, algo is mappo, exp is mlp_critic1e-3_entropy0.015_v0belief, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
File "train/train_hanabi_forward.py", line 176, in <module>
main(sys.argv[1:])
File "train/train_hanabi_forward.py", line 161, in main
runner.run()
File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 50, in run
self.collect(step)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/nfs/home/ic/on-policy/onpolicy/runner/shared/hanabi_runner_forward.py", line 153, in collect
self.use_available_actions[choose])
File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/rMAPPOPolicy.py", line 71, in get_actions
deterministic)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/nfs/home/ic/on-policy/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py", line 62, in forward
actor_features = self.base(obs)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 54, in forward
x = self.mlp(x)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/nfs/home/ic/on-policy/onpolicy/algorithms/utils/mlp.py", line 25, in forward
x = self.fc1(x)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/nfs/home/ic/miniconda3/envs/marl/lib/python3.7/site-packages/torch/nn/functional.py", line 1610, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
training is done!
I get the error with both cuda 10.1 and 11.2. Do you have any thoughts on how to fix this? Many thanks!
I met the error: wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)
I have changed the config.py
about the --user-name
to mine, but it doesn't work either.
Thank you for your reply.
Hello,
Thanks for the code base and indeed a really good work!
I was trying to replicate the results from the paper and started with MPE environment with mappo and rmappo.
But I when I train the rmappo with MPE, I get the following error:
env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
choose to use gpu...
Traceback (most recent call last):
File "train/train_mpe.py", line 167, in <module>
main(sys.argv[1:])
File "train/train_mpe.py", line 152, in main
runner.run()
File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/mpe_runner.py", line 47, in run
self.save()
File "/home/kailash/Desktop/on-policy/onpolicy/runner/shared/base_runner.py", line 134, in save
policy_vnorm = self.trainer.policy.value_normalizer
AttributeError: 'R_MAPPOPolicy' object has no attribute 'value_normalizer'
when I train the mappo with MPE, I get the following error:
env is MPE, scenario is simple_spread, algo is mappo, exp is check, max seed is 1
seed is 1:
Traceback (most recent call last):
File "train/train_mpe.py", line 167, in <module>
main(sys.argv[1:])
File "train/train_mpe.py", line 71, in main
assert (all_args.use_recurrent_policy == False and all_args.use_naive_recurrent_policy == False), ("check recurrent policy!")
AssertionError: check recurrent policy!
Please let me know if there is any mistake I did. Thanks for the help!
I have run your code:
./train_smac.sh
however, the following error occured:
env is StarCraft2, map is corridor, algo is mappo, exp is mlp, max seed is 1
seed is 1:
Traceback (most recent call last):
File "train/train_smac.py", line 175, in <module>
main(sys.argv[1:])
File "train/train_smac.py", line 82, in main
"check recurrent policy!")
AssertionError: check recurrent policy!
I changed the argument "algo" in file "tran_smac.sh"
algo="rmappo"
I don't know whether this modification is suitable or not. It did work, but the results were not satisfactory:
Do you have some good parameters for trainnning?
This is a bug that I fixed by taking a look at the other onpolicy code that you have. In shared/mpe_runner.py
in def eval()
almost at the end of the function:
eval_episode_rewards = np.array(eval_episode_rewards)
eval_env_infos = {}
eval_env_infos['eval_average_episode_rewards'] = np.sum(np.array(eval_episode_rewards), axis=0)
# print("eval average episode rewards of agent: " + str(eval_average_episode_rewards))
The eval_average_episode_rewards
is not defined and code will exit with error. Instead I used:
print("eval average episode rewards of agent: " + str(np.mean(eval_env_infos['eval_average_episode_rewards'])))
This is different than in separated/mpe_runner.py:
eval_train_infos = []
for agent_id in range(self.num_agents):
eval_average_episode_rewards = np.mean(np.sum(eval_episode_rewards[:, :, agent_id], axis=0))
eval_train_infos.append({'eval_average_episode_rewards': eval_average_episode_rewards})
print("eval average episode rewards of agent%i: " % agent_id + str(eval_average_episode_rewards))
but i guess the logic is that in the shared case agent1 and agent2 are the same so the average reward between their performance is reasonable.
def train(self):
train_infos = []
for agent_id in range(self.num_agents):
self.trainer[agent_id].prep_training()
train_info = self.trainer[agent_id].train(self.buffer[agent_id])
train_infos.append(train_info)
self.buffer[agent_id].after_update()
return train_infos
why use after_update() after train()?
Hi,
Thanks for your contribution to the community. I directly run the given example on MPE (Simple-Spread) using the default parameters except for changing the 'n_rollout_threads' into 8 due to the computation limit. However, the results on the W&B can only achieve around -170, which seems not working.
Has anybody met such a problem? Could you help me fix it?
Thanks!
When running the train_mpe file, prompts W & B, I choose to do not need visual results, but run errors, run this program to use W & B? because I have been using Tensorboard, I haven't used W & B, I don't know how to use it. Whether this program must be used to use W & B, can you join the Tensorboard selection, thank you!
Nice paper and project! And we are also doing some research on this topic.
We find there is something confused in the original MAPPO paper: why the shadow parts of the following figure in page 7 are more than win rate 1.0? What kind of metric you use in this figure? Can you offer some detailed statements? Thank you!
Has anyone had a problem similar to me ?I ran this code on the Colab platform。
/content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts
env is MPE, scenario is simple_spread, algo is rmappo, exp is check, max seed is 1
seed is 1:
choose to use cpu...
wandb: Currently logged in as: yi-li (use wandb login --relogin to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run rmappo_check_seed1
wandb: ⭐️ View project at https://wandb.ai/yi-li/MPE
wandb: 🚀 View run at https://wandb.ai/yi-li/MPE/runs/1t4tb6nn
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 149, in check_network_status
status_response = self._interface.communicate_network_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 125, in communicate_network_status
resp = self._communicate_network_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 397, in _communicate_network_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 167, in check_status
status_response = self._interface.communicate_stop_status()
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 114, in communicate_stop_status
resp = self._communicate_stop_status(status)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 387, in _communicate_stop_status
resp = self._communicate(req, local=True)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 222, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 227, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
and this is my debug.log:
2022-03-03 09:28:43,805 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /root/.config/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/wandb/settings
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Loading settings from environment variables: {}
2022-03-03 09:28:43,806 INFO MainThread:4971 [wandb_setup.py:_flush():75] Inferring run settings from compute environment: {'program_relpath': 'train/train_mpe.py', 'program': 'train/train_mpe.py'}
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():405] Logging user logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug.log
2022-03-03 09:28:43,808 INFO MainThread:4971 [wandb_init.py:_log_setup():406] Logging internal logs to /content/drive/MyDrive/MAPPO/on-policy-main/onpolicy/scripts/results/MPE/simple_spread/rmappo/check/wandb/run-20220303_092843-1t4tb6nn/logs/debug-internal.log
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():439] calling init triggers
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():443] wandb.init called with sweep_config: {}
config: {'algorithm_name': 'rmappo', 'experiment_name': 'check', 'seed': 1, 'cuda': True, 'cuda_deterministic': True, 'n_training_threads': 1, 'n_rollout_threads': 128, 'n_eval_rollout_threads': 1, 'n_render_rollout_threads': 1, 'num_env_steps': 20000000, 'user_name': 'yi-li', 'use_wandb': True, 'env_name': 'MPE', 'use_obs_instead_of_state': False, 'episode_length': 25, 'share_policy': True, 'use_centralized_V': True, 'stacked_frames': 1, 'use_stacked_frames': False, 'hidden_size': 64, 'layer_N': 1, 'use_ReLU': False, 'use_popart': False, 'use_valuenorm': True, 'use_feature_normalization': True, 'use_orthogonal': True, 'gain': 0.01, 'use_naive_recurrent_policy': False, 'use_recurrent_policy': True, 'recurrent_N': 1, 'data_chunk_length': 10, 'lr': 0.0007, 'critic_lr': 0.0007, 'opti_eps': 1e-05, 'weight_decay': 0, 'ppo_epoch': 10, 'use_clipped_value_loss': True, 'clip_param': 0.2, 'num_mini_batch': 1, 'entropy_coef': 0.01, 'value_loss_coef': 1, 'use_max_grad_norm': True, 'max_grad_norm': 10.0, 'use_gae': True, 'gamma': 0.99, 'gae_lambda': 0.95, 'use_proper_time_limits': False, 'use_huber_loss': True, 'use_value_active_masks': True, 'use_policy_active_masks': True, 'huber_delta': 10.0, 'use_linear_lr_decay': False, 'save_interval': 1, 'log_interval': 5, 'use_eval': False, 'eval_interval': 25, 'eval_episodes': 32, 'save_gifs': False, 'use_render': False, 'render_episodes': 5, 'ifi': 0.1, 'model_dir': None, 'scenario_name': 'simple_spread', 'num_landmarks': 3, 'num_agents': 3}
2022-03-03 09:28:43,809 INFO MainThread:4971 [wandb_init.py:init():492] starting backend
2022-03-03 09:28:43,809 INFO MainThread:4971 [backend.py:_multiprocessing_setup():101] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-03-03 09:28:43,831 INFO MainThread:4971 [backend.py:ensure_launched():219] starting backend process...
2022-03-03 09:28:43,840 INFO MainThread:4971 [backend.py:ensure_launched():225] started backend process with pid: 4987
2022-03-03 09:28:43,851 INFO MainThread:4971 [wandb_init.py:init():501] backend started and connected
2022-03-03 09:28:43,867 INFO MainThread:4971 [wandb_init.py:init():565] updated telemetry
2022-03-03 09:28:43,873 INFO MainThread:4971 [wandb_init.py:init():596] communicating run to backend with 30 second timeout
2022-03-03 09:28:45,203 INFO MainThread:4971 [wandb_run.py:_on_init():1759] communicating current version
2022-03-03 09:28:45,250 INFO MainThread:4971 [wandb_run.py:_on_init():1763] got version response
2022-03-03 09:28:45,251 INFO MainThread:4971 [wandb_init.py:init():625] starting run threads in backend
2022-03-03 09:28:49,989 INFO MainThread:4971 [wandb_run.py:_console_start():1733] atexit reg
2022-03-03 09:28:49,993 INFO MainThread:4971 [wandb_run.py:_redirect():1606] redirect: SettingsConsole.REDIRECT
2022-03-03 09:28:49,994 INFO MainThread:4971 [wandb_run.py:_redirect():1611] Redirecting console.
2022-03-03 09:28:50,000 INFO MainThread:4971 [wandb_run.py:_redirect():1667] Redirects installed.
2022-03-03 09:28:50,001 INFO MainThread:4971 [wandb_init.py:init():664] run started, returning control to user process
Running train_mpe
The code reports an error
Line 63 in the file "onpolicy/algorithms/utils/popart.py"
self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
self.weight = self.weight * old_stddev / self.stddev
self.bias = (old_stddev * self.bias + old_mean - self.mean) / self.stddev
File "/on-policy-main/onpolicy/algorithms/utils/popart.py", line 63, in update
self.stddev = (self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4)
File "/home/cc/anaconda3/envs/MAPPO/lib/python3.6/site-packages/torch/nn/modules/module.py", line 613, in setattr
.format(torch.typename(value), name))
TypeError: cannot assign 'torch.FloatTensor' as parameter 'stddev' (torch.nn.Parameter or None expected)
Whether the above code needs to be changed to the following code:
self.stddev = nn.Parameter((self.mean_sq - self.mean ** 2).sqrt().clamp(min=1e-4))
self.weight = nn.Parameter(self.weight * old_stddev / self.stddev)
self.bias = nn.Parameter((old_stddev * self.bias + old_mean - self.mean) / self.stddev)
In r_mappo.py, line 175
if self._use_popart:
advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
else:
advantages = buffer.returns[:-1] - buffer.value_preds[:-1]
should be fixed
if self._use_popart or self._use_valuenorm:
advantages = buffer.returns[:-1] - self.value_normalizer.denormalize(buffer.value_preds[:-1])
else:
advantages = buffer.returns[:-1] - buffer.value_preds[:-1]
This error collapses the performance of the PPO.
Others
It seems that the paper mentions the use of PopArt, but it is not actually used in the code.
Finally, MAPPO (centralized V) is mentioned in the abstract of the paper, but it is actually IPPO with global information because the value function is not centralized.
Thanks.
你好,我想在MAPPO中运行MPE的speaker_listener在7个智能体环境中的实验,但是程序提示我只能支持两个智能体,那么这个代码不支持speaker_listener下7个智能体的实验吗
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.