Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I'd try the following: <a href="https://github.com/NVlabs/CALM

Hm, I'm still getting a similar error: <div class="snippet-clipboard-content notra

Trying to run with 24Gb VRAM about calm HOT 8 CLOSED

nvlabs commented on August 22, 2024

Trying to run with 24Gb VRAM

from calm.

Comments (8)

mxsage commented on August 22, 2024 4

I was able to make it work with 20159MiB of VRAM using:

1024 environments
4096 minibatch_size
1024 amp_minibatch_size
50000 buffer sizes

If you think a different give-take will be more effective, I'm eager to hear!

It probably will be pretty slow to train (AMP base model was ~1 week as I remember). Are you able to share a rough estimate of how many GPU-hours it took to train the models described in the paper so I can do some estimates? Thank you for your time!

from calm.

tesslerc commented on August 22, 2024

I'd try the following:

https://github.com/NVlabs/CALM/blob/main/calm/data/cfg/humanoid_calm_sword_shield_getup.yaml Reduce number of envs. Try 2k.
https://github.com/NVlabs/CALM/blob/main/calm/data/cfg/train/rlg/calm_humanoid.yaml reduce minibatch size (8k for general and 2k for amp) and replay buffer sizes (50k).

from calm.

tesslerc commented on August 22, 2024

Our experiments were on a slightly larger scale so I am not sure. But feels like you're in the right direction.

We trained it for 2 weeks on an A100 (larger scale).
If AMP took you one week, I assume this will take a while, but we'll be surprised and it will converge relatively faster than expected using these settings.

from calm.

mxsage commented on August 22, 2024

I see. I was able to train the LLC enough so that the agent can successfully get up and emulate motions! So progress. However, I am getting an error when I try to run the precision training step, feeding in my just-trained LLC:

AttributeError: 'HumanoidHeadingConditioned' object has no attribute '_tar_motion_index'

Full output attached. Thanks for taking a look!
precision-training-error.txt

from calm.

tesslerc commented on August 22, 2024

Thanks for testing and helping find these issues.
I have pushed a fix.
283ff26
The issue was a mismatch in variable names that occurred during public release.

from calm.

mxsage commented on August 22, 2024

Hm, I'm still getting a similar error:

=> loading checkpoint 'output/Humanoid_22-14-39-36/nn/Humanoid.pth'
Loaded LLC checkpoint from output/Humanoid_22-14-39-36/nn/Humanoid.pth
Traceback (most recent call last):
  File "calm/run.py", line 274, in <module>
    main()
  File "calm/run.py", line 268, in main
    runner.run(vargs)
  File "/home/username/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/username/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 125, in run_train
    agent.train()
  File "/home/username/dev/_CALM/calm/learning/common_agent.py", line 120, in train
    train_info = self.train_epoch()
  File "/home/username/dev/_CALM/calm/learning/common_agent.py", line 192, in train_epoch
    batch_dict = self.play_steps() 
  File "/home/username/dev/_CALM/calm/learning/hrl_agent.py", line 137, in play_steps
    self.obs, rewards, self.dones, infos = self.env_step(res_dict['actions'])
  File "/home/username/dev/_CALM/calm/learning/hrl_agent.py", line 81, in env_step
    curr_disc_reward = self._calc_disc_reward(amp_obs)
  File "/home/username/dev/_CALM/calm/learning/hrl_conditioned_agent.py", line 72, in _calc_disc_reward
    requested_motion_indices = self.vec_env.env.task.tar_locomotion_index.view(-1)
AttributeError: 'HumanoidHeadingConditioned' object has no attribute 'tar_locomotion_index'

Additionally, I opened a pull request fixing a few path-related issues. Thanks for taking a look!

from calm.

tesslerc commented on August 22, 2024

Thank you. The pull request looks good.
Seems like there were some mistakes when cleaning up the repo for public upload.

I have pushed the relevant fixes (including small changes to the readme). I verified on a clean installation that this now works -- precision training & running the FSM (inference).

There are trained models in the calm/data/models folder if you're interested in testing things without fully training the models yourself. You can download them using git lfs.

Please let me know if you encounter any further issues.

from calm.

mxsage commented on August 22, 2024

Thanks for the fixes! Looking forward to testing more / training on separate motion capture sequences with different topology (no sword/shield). Will get back to you if any more issues arise.

from calm.

Trying to run with 24Gb VRAM about calm HOT 8 CLOSED

Comments (8)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent