Giter Club home page Giter Club logo

Comments (8)

mxsage avatar mxsage commented on August 22, 2024 4

I was able to make it work with 20159MiB of VRAM using:

1024 environments
4096 minibatch_size
1024 amp_minibatch_size
50000 buffer sizes

If you think a different give-take will be more effective, I'm eager to hear!

It probably will be pretty slow to train (AMP base model was ~1 week as I remember). Are you able to share a rough estimate of how many GPU-hours it took to train the models described in the paper so I can do some estimates? Thank you for your time!

from calm.

tesslerc avatar tesslerc commented on August 22, 2024

I'd try the following:

  1. https://github.com/NVlabs/CALM/blob/main/calm/data/cfg/humanoid_calm_sword_shield_getup.yaml Reduce number of envs. Try 2k.
  2. https://github.com/NVlabs/CALM/blob/main/calm/data/cfg/train/rlg/calm_humanoid.yaml reduce minibatch size (8k for general and 2k for amp) and replay buffer sizes (50k).

from calm.

tesslerc avatar tesslerc commented on August 22, 2024

Our experiments were on a slightly larger scale so I am not sure. But feels like you're in the right direction.

We trained it for 2 weeks on an A100 (larger scale).
If AMP took you one week, I assume this will take a while, but we'll be surprised and it will converge relatively faster than expected using these settings.

from calm.

mxsage avatar mxsage commented on August 22, 2024

I see. I was able to train the LLC enough so that the agent can successfully get up and emulate motions! So progress. However, I am getting an error when I try to run the precision training step, feeding in my just-trained LLC:

AttributeError: 'HumanoidHeadingConditioned' object has no attribute '_tar_motion_index'

Full output attached. Thanks for taking a look!
precision-training-error.txt

from calm.

tesslerc avatar tesslerc commented on August 22, 2024

Thanks for testing and helping find these issues.
I have pushed a fix.
283ff26
The issue was a mismatch in variable names that occurred during public release.

from calm.

mxsage avatar mxsage commented on August 22, 2024

Hm, I'm still getting a similar error:

=> loading checkpoint 'output/Humanoid_22-14-39-36/nn/Humanoid.pth'
Loaded LLC checkpoint from output/Humanoid_22-14-39-36/nn/Humanoid.pth
Traceback (most recent call last):
  File "calm/run.py", line 274, in <module>
    main()
  File "calm/run.py", line 268, in main
    runner.run(vargs)
  File "/home/username/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/username/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 125, in run_train
    agent.train()
  File "/home/username/dev/_CALM/calm/learning/common_agent.py", line 120, in train
    train_info = self.train_epoch()
  File "/home/username/dev/_CALM/calm/learning/common_agent.py", line 192, in train_epoch
    batch_dict = self.play_steps() 
  File "/home/username/dev/_CALM/calm/learning/hrl_agent.py", line 137, in play_steps
    self.obs, rewards, self.dones, infos = self.env_step(res_dict['actions'])
  File "/home/username/dev/_CALM/calm/learning/hrl_agent.py", line 81, in env_step
    curr_disc_reward = self._calc_disc_reward(amp_obs)
  File "/home/username/dev/_CALM/calm/learning/hrl_conditioned_agent.py", line 72, in _calc_disc_reward
    requested_motion_indices = self.vec_env.env.task.tar_locomotion_index.view(-1)
AttributeError: 'HumanoidHeadingConditioned' object has no attribute 'tar_locomotion_index'

Additionally, I opened a pull request fixing a few path-related issues. Thanks for taking a look!

from calm.

tesslerc avatar tesslerc commented on August 22, 2024

Thank you. The pull request looks good.
Seems like there were some mistakes when cleaning up the repo for public upload.

I have pushed the relevant fixes (including small changes to the readme). I verified on a clean installation that this now works -- precision training & running the FSM (inference).

There are trained models in the calm/data/models folder if you're interested in testing things without fully training the models yourself. You can download them using git lfs.

Please let me know if you encounter any further issues.

from calm.

mxsage avatar mxsage commented on August 22, 2024

Thanks for the fixes! Looking forward to testing more / training on separate motion capture sequences with different topology (no sword/shield). Will get back to you if any more issues arise.

from calm.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.