Giter Club home page Giter Club logo

Comments (5)

stoneyang avatar stoneyang commented on September 10, 2024 3

distributed_world_size should be set to the number of GPUs to use, which should be 8 here.

Uh, I've tried that configuration before. When setting distributed_world_size other than 1, the following error would raise:

[2022-02-16 14:33:42,664][fairseq.trainer][INFO] - detected shared parameter: feature_extractor_video.resnet.frontend3D.0.bias <- feature_extractor_video.resnet.trunk.layer4.1.conv2.bias
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-hydra-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 76, in cli_main
    hydra_main()
  File "/usr/local/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 688, in get_rank
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

After reading this, the distributed_port shoud be removed in single machine setting. And the training is now successfully started on all gpus!

from av_hubert.

stoneyang avatar stoneyang commented on September 10, 2024

As of suggestion provided here, the problem should be kind of weird ....

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

Hi,

What is the distributed_world_size and nprocs_per_node did you set in the config?

from av_hubert.

stoneyang avatar stoneyang commented on September 10, 2024

Hi,

What is the distributed_world_size and nprocs_per_node did you set in the config?

in av_hubert/avhubert/conf/pretrain/base_lrs3_iter1.yaml, the two options are set as follows:

distributed_training:
  ddp_backend: no_c10d
  distributed_backend: 'nccl'
  distributed_world_size: 1
  distributed_port: 29671
  nprocs_per_node: 8

the only difference is distributed_world_size is changed from 32 to 1.

from av_hubert.

chevalierNoir avatar chevalierNoir commented on September 10, 2024

distributed_world_size should be set to the number of GPUs to use, which should be 8 here.

from av_hubert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.