Comments (5)
distributed_world_size
should be set to the number of GPUs to use, which should be 8 here.
Uh, I've tried that configuration before. When setting distributed_world_size
other than 1, the following error would raise:
[2022-02-16 14:33:42,664][fairseq.trainer][INFO] - detected shared parameter: feature_extractor_video.resnet.frontend3D.0.bias <- feature_extractor_video.resnet.trunk.layer4.1.conv2.bias
Traceback (most recent call last):
File "/usr/local/bin/fairseq-hydra-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 76, in cli_main
hydra_main()
File "/usr/local/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
raise ex
File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/usr/local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/usr/local/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
ret.return_value = task_function(task_cfg)
File "/home/Code/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/home/Code/av_hubert/fairseq/fairseq_cli/train.py", line 138, in main
trainer = Trainer(cfg, task, model, criterion, quantizer)
File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 148, in __init__
if self.data_parallel_rank == 0:
File "/home/Code/av_hubert/fairseq/fairseq/trainer.py", line 181, in data_parallel_rank
return distributed_utils.get_data_parallel_rank()
File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
return get_rank(get_data_parallel_group())
File "/home/Code/av_hubert/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
return dist.get_rank(group=group)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 688, in get_rank
default_pg = _get_default_group()
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
After reading this, the distributed_port
shoud be removed in single machine setting. And the training is now successfully started on all gpus!
from av_hubert.
As of suggestion provided here, the problem should be kind of weird ....
from av_hubert.
Hi,
What is the distributed_world_size
and nprocs_per_node
did you set in the config?
from av_hubert.
Hi,
What is the
distributed_world_size
andnprocs_per_node
did you set in the config?
in av_hubert/avhubert/conf/pretrain/base_lrs3_iter1.yaml
, the two options are set as follows:
distributed_training:
ddp_backend: no_c10d
distributed_backend: 'nccl'
distributed_world_size: 1
distributed_port: 29671
nprocs_per_node: 8
the only difference is distributed_world_size
is changed from 32 to 1.
from av_hubert.
distributed_world_size
should be set to the number of GPUs to use, which should be 8 here.
from av_hubert.
Related Issues (20)
- Concatenation of features from multiple videos HOT 1
- Errors with distributed training on a single node HOT 1
- I'm trying to pretrain a model with another langauge HOT 1
- Problem Extracting Audio-Only Embeddings using AV-HuBERT HOT 1
- How to load a pre-trained AVHuBERT? (problems after following the instructions) HOT 6
- best_checkpoint_metric is accuracy even for finetuning? HOT 2
- Errors about loading a pretrained model
- a recipe and checkpoint for CTC decoding using CMUDict HOT 1
- Too many CPU resources for fine-tuning HOT 2
- Inference for AVSR
- How to train model on Urdu dataset HOT 1
- The counts of correct predictions for both masked and unmasked tokens are considerably low.
- Unable to Locate mix_babble.py for LRS3 Audio Noise Preparation
- Reproducing base_noise_pt_noise_ft_30h.pt
- To load the previous model when doing double finetuning,
- How to extract embeddings from a specific layer
- 1. Differences between "extract_finetune" and "extract_features" and 2. extract discrete unit feature
- Can I use this work to generate/predict new audio frame in real time?
- decode text with temporal or duration information
- Enquiry about Visual Hubert pre-training procedures, hyper-parameter settings and Visual Hubert model parameters.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from av_hubert.