Hi, there, When started training on my machine with eight gpus with

As of suggestion provided <a href="https://github.com/pytorch/fairseq/issues/915#issue

Hi, What is the distributed_world_size</co

fairseq-hydra-train with single-node multiple-gpu training about av_hubert HOT 5 CLOSED

facebookresearch commented on September 10, 2024

fairseq-hydra-train with single-node multiple-gpu training

from av_hubert.

Comments (5)

stoneyang commented on September 10, 2024 3

distributed_world_size should be set to the number of GPUs to use, which should be 8 here.

Uh, I've tried that configuration before. When setting distributed_world_size other than 1, the following error would raise:

[2022-02-16 14:33:42,664][fairseq.trainer][INFO] - detected shared parameter: feature_extractor_video.resnet.frontend3D.0.bias <- feature_extractor_video.resnet.trunk.layer4.1.conv2.bias
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-hydra-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 76, in cli_main
    hydra_main()
  File "/usr/local/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/Code/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
    trainer = Trainer(cfg, task, model, criterion, quantizer)
  File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
    if self.data_parallel_rank == 0:
  File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
    return distributed_utils.get_data_parallel_rank()
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
    return get_rank(get_data_parallel_group())
  File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
    return dist.get_rank(group=group)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 688, in get_rank
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

After reading this, the distributed_port shoud be removed in single machine setting. And the training is now successfully started on all gpus!

from av_hubert.

stoneyang commented on September 10, 2024

As of suggestion provided here, the problem should be kind of weird ....

from av_hubert.

chevalierNoir commented on September 10, 2024

Hi,

What is the distributed_world_size and nprocs_per_node did you set in the config?

from av_hubert.

stoneyang commented on September 10, 2024

Hi,

What is the distributed_world_size and nprocs_per_node did you set in the config?

in av_hubert/avhubert/conf/pretrain/base_lrs3_iter1.yaml, the two options are set as follows:

distributed_training:
  ddp_backend: no_c10d
  distributed_backend: 'nccl'
  distributed_world_size: 1
  distributed_port: 29671
  nprocs_per_node: 8

the only difference is distributed_world_size is changed from 32 to 1.

from av_hubert.

chevalierNoir commented on September 10, 2024

distributed_world_size should be set to the number of GPUs to use, which should be 8 here.

from av_hubert.

fairseq-hydra-train with single-node multiple-gpu training about av_hubert HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent