Giter Club home page Giter Club logo

accelerate's Introduction



License Documentation GitHub release Contributor Covenant

Run your *raw* PyTorch training script on any kind of device

Easy to integrate

πŸ€— Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

πŸ€— Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.

Here is an example:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

As you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp8, fp16, bf16).

In particular, the same code can then be run without modification on your local machine for debugging or your training environment.

πŸ€— Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

- device = 'cpu'
+ accelerator = Accelerator()

- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
  optimizer = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
-         source = source.to(device)
-         targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

Want to learn more? Check out the documentation or have a look at our examples.

Launching script

πŸ€— Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use torch.distributed.run or to write a specific launcher for TPU training! On your machine(s) just run:

accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):

accelerate launch examples/nlp_example.py

This CLI tool is optional, and you can still use python my_script.py or python -m torchrun my_script.py at your convenience.

You can also directly pass in the arguments you would to torchrun as arguments to accelerate launch if you wish to not run accelerate config.

For example, here is how to launch on two GPUs:

accelerate launch --multi_gpu --num_processes 2 examples/nlp_example.py

To learn more, check the CLI documentation available here.

Launching multi-CPU run using MPI

πŸ€— Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on this page. You can use Intel MPI or MVAPICH as well. Once you have MPI setup on your cluster, just run:

accelerate config

Answer the questions that are asked, selecting to run using multi-CPU, and answer "yes" when asked if you want accelerate to launch mpirun. Then, use accelerate launch with your script like:

accelerate launch examples/nlp_example.py

Alternatively, you can use mpirun directly, without using the CLI like:

mpirun -np 2 python examples/nlp_example.py

Launching training using DeepSpeed

πŸ€— Accelerate supports training on single/multiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just accelerate config. However, if you desire to tweak your DeepSpeed related args from your Python script, we provide you the DeepSpeedPlugin.

from accelerate import Accelerator, DeepSpeedPlugin

# deepspeed needs to know your gradient accumulation steps beforehand, so don't forget to pass it
# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

# How to save your πŸ€— Transformer?
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))

Note: DeepSpeed support is experimental for now. In case you get into some problem, please open an issue.

Launching your training from a notebook

πŸ€— Accelerate also provides a notebook_launcher function you can use in a notebook to launch a distributed training. This is especially useful for Colab or Kaggle notebooks with a TPU backend. Just define your training loop in a training_function then in your last cell, add:

from accelerate import notebook_launcher

notebook_launcher(training_function)

An example can be found in this notebook. Open In Colab

Why should I use πŸ€— Accelerate?

You should use πŸ€— Accelerate when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop. This is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library. In fact, the whole API of πŸ€— Accelerate is in one class, the Accelerator object.

Why shouldn't I use πŸ€— Accelerate?

You shouldn't use πŸ€— Accelerate if you don't want to write a training loop yourself. There are plenty of high-level libraries above PyTorch that will offer you that, πŸ€— Accelerate is not one of them.

Frameworks using πŸ€— Accelerate

If you like the simplicity of πŸ€— Accelerate but would prefer a higher-level abstraction around its capabilities, some frameworks and libraries that are built on top of πŸ€— Accelerate are listed below:

  • Amphion is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
  • Animus is a minimalistic framework to run machine learning experiments. Animus highlights common "breakpoints" in ML experiments and provides a unified interface for them within IExperiment.
  • Catalyst is a PyTorch framework for Deep Learning Research and Development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write yet another train loop. Catalyst provides a Runner to connect all parts of the experiment: hardware backend, data transformations, model training, and inference logic.
  • fastai is a PyTorch framework for Deep Learning that simplifies training fast and accurate neural nets using modern best practices. fastai provides a Learner to handle the training, fine-tuning, and inference of deep learning algorithms.
  • Finetuner is a service that enables models to create higher-quality embeddings for semantic search, visual similarity search, cross-modal text<->image search, recommendation systems, clustering, duplication detection, anomaly detection, or other uses.
  • InvokeAI is a creative engine for Stable Diffusion models, offering industry-leading WebUI, terminal usage support, and serves as the foundation for many commercial products.
  • Kornia is a differentiable library that allows classical computer vision to be integrated into deep learning models. Kornia provides a Trainer with the specific purpose to train and fine-tune the supported deep learning algorithms within the library.
  • Open Assistant is a chat-based assistant that understands tasks, can interact with their party systems, and retrieve information dynamically to do so.
  • pytorch-accelerated is a lightweight training library, with a streamlined feature set centered around a general-purpose Trainer, that places a huge emphasis on simplicity and transparency; enabling users to understand exactly what is going on under the hood, but without having to write and maintain the boilerplate themselves!
  • Stable Diffusion web UI is an open-source browser-based easy-to-use interface based on the Gradio library for Stable Diffusion.
  • torchkeras is a simple tool for training pytorch model just in a keras style, a dynamic and beautiful plot is provided in notebook to monitor your loss or metric.
  • transformers as a tool for helping train state-of-the-art machine learning models in PyTorch, Tensorflow, and JAX. (Accelerate is the backend for the PyTorch side).

Installation

This repository is tested on Python 3.8+ and PyTorch 1.10.0+

You should install πŸ€— Accelerate in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install PyTorch: refer to the official installation page regarding the specific install command for your platform. Then πŸ€— Accelerate can be installed using pip as follows:

pip install accelerate

Supported integrations

  • CPU only
  • multi-CPU on one node (machine)
  • multi-CPU on several nodes (machines)
  • single GPU
  • multi-GPU on one node (machine)
  • multi-GPU on several nodes (machines)
  • TPU
  • FP16/BFloat16 mixed precision
  • FP8 mixed precision with Transformer Engine
  • DeepSpeed support (Experimental)
  • PyTorch Fully Sharded Data Parallel (FSDP) support (Experimental)
  • Megatron-LM support (Experimental)

Citing πŸ€— Accelerate

If you use πŸ€— Accelerate in your publication, please cite it by using the following BibTeX entry.

@Misc{accelerate,
  title =        {Accelerate: Training and inference at scale made simple, efficient and adaptable.},
  author =       {Sylvain Gugger and Lysandre Debut and Thomas Wolf and Philipp Schmid and Zachary Mueller and Sourab Mangrulkar and Marc Sun and Benjamin Bossan},
  howpublished = {\url{https://github.com/huggingface/accelerate}},
  year =         {2022}
}

accelerate's People

Contributors

abhilash1910 avatar akx avatar benjaminbossan avatar chris-hughes10 avatar dberenbaum avatar faaany avatar fxmarty avatar liamswayne avatar lysandrejik avatar mishig25 avatar muellerzr avatar pacman100 avatar patrickvonplaten avatar pcuenca avatar philschmid avatar ranchlai avatar ryanrussell avatar sgugger avatar stas00 avatar statelesshz avatar stevhliu avatar sumanthrh avatar sunmarc avatar sywangyi avatar thomwolf avatar will-cromar avatar xloem avatar younesbelkada avatar yuxinyuan avatar zhiyuanchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accelerate's Issues

Multi machine training not working

I am trying to run my training code on 2 machines. Each of them has 2 GPUs. However, it seems the program runs separately and do not fasten the training progress. Here is my config.yaml

machine 1:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: 192.168.0.1
main_process_port: 99999
main_training_function: main
num_machines: 2
num_processes: 4

machine 2:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 1
main_process_ip: 192.168.0.1
main_process_port: 99999
main_training_function: main
num_machines: 2
num_processes: 4

how to turn off accelerate?

when I want to run my program without accelerate, I just type in "python my_file" instead of "accelerate launch my_file".
However, my program got stuck.
Could you help me? Thanks!

Address already in use

When I am running two programs with the same python file, I encountered the following issue. I think it is because of the address/port, but I do not know how to change it.

Traceback (most recent call last): File "run_clm_no_trainer.py", line 490, in <module> main() File "run_clm_no_trainer.py", line 207, in main accelerator = Accelerator() File "/home/.conda/envs/transformers-sgd/lib/python3.7/site-packages/accelerate/accelerator.py", line 79, in __init__ self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True) File "/home/.conda/envs/transformers-sgd/lib/python3.7/site-packages/accelerate/state.py", line 125, in __init__ torch.distributed.init_process_group(backend="nccl") File "/home/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/.local/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use

Error in running multi GPU model

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:761, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

Training Time For Multinode is similar to training time for single node

@sgugger As per our discussion on #37
Here is my method of launch and results
I am using AML Per process launcher shown here (option 1):
https://azure.github.io/azureml-cheatsheets/ja/docs/cheatsheets/python/v1/distributed-training/
To launch a distributed training job with multiple process per nodes.

I am using traditional run_mlm.py (without trainer from transformers repository here )

When I observe the tqdm log from when running for 1 node vs running with 8 nodes:
I observe the following:

8 Node log :

INFO:main:***** Running training *****
INFO:main: Num examples = 4842767
INFO:main: Num Epochs = 10
INFO:main: Instantaneous batch size per device = 64
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 2048
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 23650

0%| | 20/23650 [01:20<26:05:59, 3.98s/it]
0%| | 21/23650 [01:24<26:04:43, 3.97s/it]
0%| | 22/23650 [01:28<26:12:45, 3.99s/it]

1Node Log:
INFO:main:***** Running training *****
INFO:main: Num examples = 4842767
INFO:main: Num Epochs = 10
INFO:main: Instantaneous batch size per device = 64
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 256
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 189180

0%| | 7/189180 [00:04<27:49:11, 1.89it/s]
0%| | 8/189180 [00:04<27:27:56, 1.91it/s]

As you can see the iteration time per step for single node is about 8 times higher than 8 node.
You can see training time estimated for one node is lesser than 8 nodes.

Can't send the values of int to device

My training data looks like:

src_image, target_image, src_camera, target_camera, src_camera_idx, target_camera_idx

Where src_camera_idx, target_camera_idx are integers

When I try to apply accelerate I get the following error:
TypeError: Can't send the values of type <class 'int'> to device cuda:0, only of nested list/tuple/dicts of tensors or objects having a to method.

We don't need to send the integers to the device. Perhaps instead of raising an error here, you can simply skip the items that cannot be moved to device? Or at least give me the option to skip them if I know my data has such objects.

Typo in examples README

Happened to find a typo in the examples README.
Lines 60 and 148 use the option fb16, which I believe should be fp16.

Thank you for this new library and launch tool!

learning rate decay

I want to know if I use learning rate decay , like this:

optimizer= accelerator.prepare(optimizer)
current_lr = learning_rate * decay_factor
for group in optimizer.param_groups:
        group['lr'] = current_lr

Should I do this for the main process or all processes?
Thanks

How to save models with Accelerator.save in DDP mode

Hi,

My config file is

{
  "compute_environment": "LOCAL_MACHINE",
  "distributed_type": "MULTI_GPU",
  "fp16": false,
  "machine_rank": 0,
  "main_process_ip": null,
  "main_process_port": null,
  "main_training_function": "main",
  "num_machines": 1,
  "num_processes": 2
}

when I use Accelerator.save(unwrapped_model.state_dict(), path), the model will be saved twice (because I used two gpus)

In the PyTorch DDP example, they save the model only when the rank is 0, which avoid saving the model multiple times. How can I do that with accelerate?

Thanks!

Mismatch between `accelerate config` cli and `default_config.yaml`

The generated default_config.yaml is mismatch with accelerate config.

Here are my cli outputs and default_config.yaml

cli outputs

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 1
What is the IP address of the machine that will host the main process? 10.29.150.50
What is the port you will use to communicate with the main process? 2333
How many processes in total will you use? [1]: 6
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes

default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 1
main_process_ip: 2333
main_process_port: null
main_training_function: main
num_machines: 2
num_processes: 6

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels

I installed and imported accelerate in both Kaggle Kernels and Google Colab with TPU turned on but it doesn't seem to detect the TPU and instead detects CPU when running the following code:

$ pip install -q accelerate
import accelerate
acc = accelerate.Accelerator()
device = acc.device
print(device)

The above snippet just outputs cpu when ran on both aforementioned platforms with TPU enabled.

Is there something that I am doing wrong?

PyTorch version: 1.7.0
Python version: 3.7.9

dsitributed on mulit-CPU

Just wondering if there is way to distribute over CPU (single node, or multi nodes).
It would be very useful features for some sparse models.

Different performance when training with single GPU vs. multiple GPUs

I'm currently using accelerate to fine-tune a huggingface pretrained Transformer with some additional classification heads, and I'm finding that performance when using multiple GPUs is much worse than with a single GPU, even when using the same batch size, learning rate, and number of training steps/epochs. I'm using accelerate to parallelize the training loop over multiple GPUs, but the validation/test set evaluation is a custom function that isn't easily adapted to use with accelerate so I'm doing that part on a single GPU in the main process. To run the entire script on a single GPU vs. multiple GPUs, I just adjust the --num_processes argument for accelerate launch as well as the batch size to match, for example:

accelerate launch --num_processes 1 <script> --batch_size 32 (for 1 GPU)

accelerate launch --num_processes 4 <script> --batch_size 8 (for 4 GPUs)

The multi-GPU training seems to be running fine, in the sense that running nvidia-smi shows all 4 GPUs being fully utilized and the training data loader is the correct length for the given batch size (same length for both of the commands above), but there's still a drop in performance in the multi-GPU case. When printing output on each GPU, the processes do seem to be waiting for the main process to finish running the evaluation function as expected. This also doesn't seem to just be an issue of running single-GPU evaluation within a multi-GPU training loop, since loading the saved model weights after training and re-running the evaluation on a single GPU gives the same performance.

Any help is appreciated, thanks!

Pseudocode:

# set up accelerator
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

# set up device (for evaluation)
device = accelerator.device

model = ... # initialize model
optimizer = ... # initialize optimizer
loader = ... # initialize training data loader

valid_dataset = ... # initialize validation dataset
test_dataset = ... # initialize test dataset

# prepare model, optimizer, data loader
model, optimizer, loader = accelerator.prepare(model, optimizer, loader)

# training loop
for epoch in range(epochs):
    model.train()
    for inputs, targets in loader:
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()

    # evaluate on validation set with unwrapped model in main process (single-GPU)
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model).to(device)
        unwrapped_model.eval()
        metrics = calculate_metrics(unwrapped_model, valid_dataset, device)
        print(metrics)

# evaluate on test set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
    unwrapped_model = accelerator.unwrap_model(model).to(device)
    unwrapped_model.eval()
    metrics = calculate_metrics(unwrapped_model, test_dataset, device)
    print(metrics)

Version information:

  • torch: 1.6.0
  • transformers: 3.3.1
  • accelerate: 0.3.0
  • CUDA: 10.1

Unable to send extra params to DDP

My model's forward function returns losses as well as some debug output. So I need to set find_unused_parameters=True in the ddp, but there's no way to pass this in the preparation.

Perhaps we could pass some kwargs for the items being prepared? I'm not sure how generic this problem is.

Deepspeed

Hi, I was just wondering if there were any future plans to to integrate deepspeed or equivalent functionality as a backend (like the Transformers library does)?

Training after several epochs it throws cudaErrorLaunchFailure: unspecified launch failure

When I use accelerate to my model, it often throws this problem after training several epochs. And my code as follows:

    def train(self, train_step, compute_metric, eval_step, train_scratch=False):
        wandb.init(project=self.model._get_name(), resume=~train_scratch)
        self.logger.info("Start training model")
        wandb.watch(self.model, log="all")
        wandb.save("model.py")
        set_seed(self.config.seed)
        self.model.train()
        accelerator = Accelerator(fp16=True)
        start_epoch = 1
        data_loader = self.get_train_dataset_dataloader()
        optimizer = self.optimizer(self.model.parameters(), lr=self.config.lr)
        schedule = get_cosine_schedule_with_warmup(optimizer,
                                                   num_warmup_steps=1000,
                                                   num_training_steps=len(data_loader) * self.config.epochs)
        if train_scratch:
            model, optimizer, data_loader = accelerator.prepare(self.model, optimizer, data_loader)
            start_epoch = 1
            best_metric = 0.
        elif os.path.exists(self.output_dir) and os.listdir(self.output_dir):
            checkpoint = self.load_checkpoint()
            self.model.load_state_dict(checkpoint['net'])
            optimizer.load_state_dict(checkpoint["optimizer"])
            schedule.load_state_dict(checkpoint['scheduler'])
            start_epoch = checkpoint['epoch'] + 1 if checkpoint['epoch'] > 1 else 1
            best_metric = checkpoint["best_metric"]
            model, optimizer, data_loader = accelerator.prepare(self.model, optimizer, data_loader)

        last_step = len(data_loader) - 1
        train_loss, train_score, log_info = 0., {}, {}
        eval_loss, eval_score = 0., {}
        for epoch in range(start_epoch, self.config.epochs + 1):
            # do train
            model.train()
            for i, data in enumerate(tqdm(data_loader, desc=f"Epoch {epoch}/{self.config.epochs}"), start=1):
                torch.cuda.empty_cache()
                outputs = train_step(model=model, data=data)
                loss = outputs.loss / self.config.accumulate_step
                train_loss += loss.item()
                accelerator.backward(loss)
                if i % self.config.accumulate_step == 0 or i == last_step:
                    optimizer.step()
                    schedule.step()
                    optimizer.zero_grad()
                    wandb.log({"lr": schedule.get_last_lr()[-1]},
                              step=math.ceil(i / self.config.accumulate_step) + math.ceil(
                                  (epoch - 1) * len(data_loader) / self.config.accumulate_step))
                compute_metric(outputs.logits, data['labels'], self.metrics)
            train_loss = train_loss / len(data_loader)
            for metric in self.metrics:
                train_score.update(metric.compute())
            log_info.update({"train_acc": train_score["accuracy"], "train_loss": train_loss})
            self.logger.info("Epoch {}/{} train_loss={:.5f}\t train_accuracy={:.5f}".format(epoch, self.config.epochs,
                                                                                            loss,
                                                                                            train_score['accuracy']))
            # do eval
            if self.eval_dataset:
                model.eval()
                eval_loader = accelerator.prepare(self.get_val_dataset_dataloader())
                for i, data in enumerate(tqdm(eval_loader, desc=f"Eval Epoch {epoch}/{self.config.epochs}"), start=1):
                    outputs = eval_step(model, data)
                    eval_loss += outputs.loss.item()
                    compute_metric(outputs.logits, data['labels'], self.metrics)
                eval_loss = eval_loss / len(eval_loader)
                for metric in self.metrics:
                    eval_score.update(metric.compute())
                log_info.update({"eval_loss": eval_loss, "eval_acc": eval_score["accuracy"]})
                self.logger.info(
                    "Eval Epoch {}/{} eval_loss={:.5f}\t eval_accuracy={:.5f}".format(epoch, self.config.epochs,
                                                                                      eval_loss,
                                                                                      eval_score['accuracy']))

And the error as follows:

`Traceback (most recent call last):
  File "D:/code/GNN_LM1/train.py", line 43, in <module>
    trainer.train(train_step, msm_compute_metric, eval_step, train_scratch=True)
  File "D:\code\GNN_LM1\trainer.py", line 138, in train
    train_loss += loss.item()
  File "C:\Anaconda3\lib\site-packages\accelerate\accelerator.py", line 249, in backward
    self.scaler.scale(loss).backward()
  File "C:\Anaconda3\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

pytorch version: 1.8
I want to know what cause this problem and how to make it?

Multi GPU Training Not Working

While using Accelerate, it is only utilizing 1 out of the 2 GPUs present. I am training using the general instructions in the repository. The architecture is AutoEncoder.

dataloader = DataLoader(dataset, batch_size = 2048, shuffle=True, pin_memory=False, num_workers=20)
encoder = Encoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
decoder = Decoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
model = AutoEncoder(encoder, decoder).to(device)
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

I am transferring the samples in the batch to the device using the code below:

    for x in batch:
        batch[x] = batch[x].to(device)

The device is being determined by using:

device = accelerator.device

Both devices are visible which can be confirmed by using torch.cuda.device_count() which returns 2.

Devices are RTX 2080 with CUDA Version 11.2. Driver version is 460.67.
Distro is PopOS!.

Outside of venv?

I'm interested in using this on a jupyterhub machine, which is a single node. Do you expect that to have any issues? I can only assume the recommendation to use a virtual environment is for package compatibility, which I'm comfortable with sorting through in the base jupyterhub environment.

DDP how to evaluate with custom metrics

Hi there,

Is there a way to evaluate the entire eval dataset using customized metrics instead of datasets.Metric ?
I'm using similar code to this
I'm fine tuning a T5 on multitask learning so I can't use the metric directly, because different prefix associate with different evaluation metrics.

Please advise
Thanks for your help!

Variables or Operations to identify rank at the begining of the program

I need to identify the main or first process to do something(for example, log) at the begining of the program, but I can't find any variables that can make it.

Only after some module/dataloader/optimzer is prepared by accelerator.prepare(), then torch .distributed.get_rank() can be used to do it.

Are there any other variables or operations(like an empty prepare() function) can help me get a flag to distinguish these process at program begin?

Any suggestion?

Config class(decorated by dataclass) can not receive kwargs from the config dict.

First thanks for your excellent work. But it seems that i have come across a very strange problem. After I finished "accelerate config", I launched my script using "accelerate launch my_script.py", and I got the error as follows:

Traceback (most recent call last):
  File "/home/wuyongfa/anaconda3/envs/mmdet/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/launch.py", line 297, in launch_command
    defaults = load_config_from_file(args.config_file)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/config/config_args.py", line 61, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/config/config_args.py", line 100, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'machine_rank'

I assume that this is because the BaseConfig class(decorated by dataclass) can not receive kwargs from the config dict. Could anyone help me to find why?

Access Global Variables Inside of training_function()

I am trying to use Accelerate to do large-scale model inference. In particular, I am using T5 to transform strings into a different format on Google's Colab TPUs.

This has been working fine, as I can print my outputs and verify they are correct. However, when I try to store the model outputs I seem to be unable to do so. The global variables do not seem to be recognized once I run the notebook launcher and I can't return anything from the function. Any advice on how to do this?

Here is some pseudocode of what I would like to happen

outputs = []
def training_function():
    global outputs
    model = T5ForConditionalGeneration.from_pretrained('t5-base')
    model, dataloader = accelerator.prepare(
        model, dataloader
    )
    for batch in dataloader:
        attention_mask, input_ids = batch['attention_mask'], batch['input_ids']
        output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
        outputs.append(output)

notebook_launcher(training_function)
print(outputs)

Thanks!

Torch Geometric compatibility

Hi,

Awesome package, I'm really liking how easy it is to plug-and-play in my training scripts.

Would it be possible to have compatibility with PyTorch geometric? (graph neural networks, etc). torch geometric uses a custom collate function in their DataLoader to deal with graph-like data, so right now putting it into the Accelerator gives this error upon iterating:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'torch_geometric.data.data.Data'>

Here is the relevant code: https://github.com/rusty1s/pytorch_geometric/blob/480f9d59d6d18166a5da2e2519fa9a6b33d3d4ad/torch_geometric/data/dataloader.py#L8-L70

Thanks!
Miles

Feature request: Add support for NamedTuples in dataloaders

In order to produce self-documenting code, our team has the habit of using NamedTuples instead of plain tuples as the return type of our datasets.

Since they are sub-classes of tuples, they are handled by every PyTorch mechanism we've encountered as if they are tuple, so it all works smoothly.

When using Accelerate, I get an error in send_to_device at line 114. type(tensor)(send_to_device(t, device) for t in tensor) raises a type error because type(tensor) returns my NamedTuple, which cannot use the generator (send_to_device(t, device) for t in tensor) as argument. We would need some way to transform the generator into a type that is accepted by the NamedTuple.

Loading a checkpoint saved using accelerator.save in Multi-GPU setting

Hi

This might be a noob question but I couldn't figure out a way to load checkpoints that were saved using accelerator.save. If I use torch.load to load the model state_dict in a Multi-GPU setting, it loads it multiple times on the first GPU, which leads to OOM.

config = T5Config().from_pretrained('t5-small')
model = T5ForConditionalGeneration(config)
checkpoint = torch.load(checkpoint_location)
model.load_state_dict(checkpoint )

I am however able to load checkpoints using model.from_pretrained and it works in multi GPU setting

model = T5ForConditionalGeneration.from_pretrained('t5-small')

This does not solve my problem since I need to load models saved using accelerator.save

Any help would be appreciated!

Multi-GPU CLI issue

Hi- Thanks for the great library, Sylvain!

The config file looks as follows:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

The relevant part of the code is as follows:

    accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
    print(accelerator.device)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
    )
    valid_dataloader = DataLoader(
        validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )
    test_dataloader = DataLoader(
        test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    prepared = accelerator.prepare(
        model, optimizer, train_dataloader, valid_dataloader, test_dataloader
    )
    model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared


    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            #batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / gradient_accumulation_steps
            accelerator.backward(loss)
            if step % gradient_accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

The script utilizes a single GPU, though there are 2 GPUS.

>>> torch.cuda.device_count()
2

Launching the scipt in the command line:

accelerate launch training.py

The print statement print(accelerator.device) returns following (happy to add more debugging)

cuda

Any help is appreciated. Thank you!

question about amp

hello. I'm very excited for using this library.

I have a question. can it use with torch.amp?

I want to use both library for training!

thanks!

accelerator.prepare fails with IterableDataset

I am trying to use accelerator library in my example script but it is failing when trying to use with IterableDataset. Here is the error message:

Traceback (most recent call last):
  File "run_pretrain.py", line 612, in <module>
    main()
  File "run_pretrain.py", line 466, in main
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 201, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 201, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 159, in _prepare_one
    return self.prepare_data_loader(obj)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 231, in prepare_data_loader
    return prepare_data_loader(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/data_loader.py", line 416, in prepare_data_loader
    return DataLoaderShard(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/data_loader.py", line 280, in __init__
    super().__init__(dataset, **kwargs)
  File "/nfs_home/ddkalamk/anaconda3/envs/bert/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 194, in __init__
    raise ValueError(
ValueError: DataLoader with IterableDataset: expected unspecified batch_sampler option, but got batch_sampler=<torch.utils.data.sampler.BatchSampler object at 0x2ba0839164f0>
srun: error: pcl-skx35: task 0: Exited with exit code 1

I am using pytorch v1.6.0 but it seems it has same issue even with latest pytorch.
Is there anything I am missing?

Error in GAN Training Code

Code:

import torch
import torch.nn as nn
import torch.optim as optim 
import numpy as np
from accelerate import Accelerator


def bce_false(x):
    bce = nn.BCEWithLogitsLoss(reduction='none')
    target = torch.zeros(x.size()).cuda()
    return bce(x, target)


def bce_true(x):
    bce = nn.BCEWithLogitsLoss(reduction='none')
    target = torch.ones(x.size()).cuda()
    return bce(x, target)

accelerator = Accelerator()


class Discriminator(nn.Module):

    def __init__(self, in_dim=1, image_size=128, conv_dim=64, c_dim=512, repeat_num=6):
        super(Discriminator, self).__init__()                
            
        layers = []                    
        
        layers.append(
           nn.Sequential(
                nn.Conv2d(in_dim, conv_dim, kernel_size=4, stride=2, padding=1),                
                nn.BatchNorm2d(conv_dim, affine=True, track_running_stats=True),
                nn.LeakyReLU(inplace=True)) 
        )
        
        curr_dim = conv_dim
        for i in range(1, repeat_num):
            layer = nn.Sequential(
                nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1),                
                nn.BatchNorm2d(curr_dim*2, affine=True, track_running_stats=True),
                nn.LeakyReLU(inplace=True))
                                 
            layers.append(layer)                   
            
            curr_dim = curr_dim * 2
        self.down = nn.ModuleList(layers)

        kernel_size = int(image_size / np.power(2, repeat_num))
        self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=3, stride=1, padding=1, bias=False)        

    def forward(self, x):
        
        (b, t, c, h, w) = x.size()
        x = x.view(b * t, -1, h, w)        
        
        for layer in self.down:
            x = layer(x)                                         
        out_src = self.conv1(x)
        
        return out_src  


D = Discriminator(image_size=96).cuda()


lr = 1e-4
optim_d = optim.Adam(D.parameters(), lr = lr, weight_decay=1e-4)  
optim_d, D = accelerator.prepare(optim_d, D)


torch.autograd.set_detect_anomaly(True)
video = torch.zeros(12, 29, 1, 96, 96).cuda()
loss_d = 0.0
loss_d = bce_true(D(video.clone())).reshape(-1).mean()
loss_d = loss_d + bce_false(D(video.clone())).reshape(-1).mean()
optim_d.zero_grad()
accelerator.backward(loss_d)
optim_d.step()

Error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

This code runs correct in PyTorch Single GPU mode. So it might cause by Accelerate ?

accelerator.gather() for non-tensor inputs?

Can I use gather() to merge two non-tensor inputs? For example:
One process has a list of string like [ ['a', 'aa'], ['b', 'bb'] ], another has a list like [ ['c', 'cc'], ['d', 'dd'] ].
Is there any way to merge these two lists into [ ['a', 'aa'], ['b', 'bb'], ['c', 'cc'], ['d', 'dd'] ]?
I think it's necessary when evaluating the model, only gathering the logits and labels may not be enough sometimes.

Regarding the problem that the accelerate library does not work in multiple routines

I want to complete a task on 2 nodes and 4 GPUs.
I configured the config file as required (in 2 nodesοΌ‰

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 0
What is the IP address of the machine that will host the main process? same IP
What is the port you will use to communicate with the main process? same port
How many processes in total will you use? [1]: 2
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes

But when I run accelerate launch train.py on these two nodes, they do not cooperate.They will each complete the training task
i dont know and how to do.

In addition, is this related to the absolute path when I use (--data_dir --model --output_dir)?

`accelerate test` ignores --config-file

Hi,
Great package, it helped me a lot today! So far it is as simple as it seems πŸŽ‰

I noticed that accelerate test --config_file accelerate_config.yml uses default config values instead of the values from accelerate_config.yml. To test this create an accelerate_config.yml file with the contents different from your main config. For example, say that the default config has num_processes=3, but you want to only use 2 GPUs and create a config like this.

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

after running accelerate test --config_file accelerate_config.yml you will see something like this

Distributed environment: MULTI_GPU
Num processes: 3
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False

Distributed environment: MULTI_GPU
Num processes: 3
Process index: 2
Local process index: 2
Device: cuda:2
Use FP16 precision: False

**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU
Num processes: 3
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False

three GPUs instead of the specified 2.

This happens because accelerate-launch requires all keyword arguments to precede the training script path, but accelerate-test does this

cmd = ["accelerate-launch"] + test_args

so what happens is that --config_file is recognized as a training script argument instead of launch script argument. You can see it if you print out args of the corresponding accelerate-launch.

Sending a PR with a fix soon.

UnboundLocalError when running on Google Colab using TPU runtime

Steps to reproduce

  1. Open new Google Colab notebook and choose TPU runtime.

  2. Install accelerate

    !pip install accelerate
    
  3. Run accelerate config

    In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
    Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 2
    What is the name of the function in your script that should be launched in all parallel scripts? [main]: 
    Traceback (most recent call last):
      File "/usr/local/bin/accelerate", line 8, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
        args.func(args)
      File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/__init__.py", line 64, in config_command
        config = get_user_input()
      File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/__init__.py", line 37, in get_user_input
        config = get_cluster_input()
      File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/cluster.py", line 81, in get_cluster_input
        num_processes=num_processes,
    UnboundLocalError: local variable 'num_processes' referenced before assignment
    

[QUESTION] Why do we reinit `AcceleratorState` everytime we prepare an object again?

Hey @sgugger !

Thanks for the clean & concise code. I love it!

Could I please ask what's the idea behind initializing state=AcceleratorState() inside prepare_data_loader here again?

In which case would this condition if num_processes is None: or if process_index is None: be True please? As I understand inside the Accelerator we have already set state=AcceleratorState() which sets the variables num_processes, process_index etc based on the config.

So why do need to initialize AcceleratorState again please?

nlp example doesn't run faster with multi-gpu

I am running the example on 2080ti, where each epoch takes 21 seconds with 1 GPU. When using 2 GPUs, it also takes 21 seconds. (I used tqdm to measure the time)
Everything looks right: GPU utilization is ~100% in both GPUs, the # of batch per device is halved, but it's not faster.
I tried the cv example, and using 2 GPUs does speed up the training.

Is this normal? What might be the cause of this?

How to use on CPU?

Hey guys, how do I use the accelerator on CPU?

acc = Accelerator(cpu=True)
print(acc.device)

Output cuda.

Thank you!

Cheers,

Francesco

Expected to have finished reduction in the prior iteration before starting a new one.

I have modified the nlp_example to finetune an EncoderDecoder on translation data like this:

accelerator = Accelerator(device_placement=False, fp16=args.fp16, cpu=args.cpu)
def _tokenize(batch):
    if accelerator.distributed_type == DistributedType.TPU:
        src = tokenizer(batch[0], padding="max_length", max_length=128, return_tensors="pt")
        tgt = tokenizer(batch[1], padding="max_length", max_length=128, return_tensors="pt")
    else:
        src = tokenizer(list(batch[0]), padding="longest", return_tensors="pt")
        tgt = tokenizer(list(batch[1]), padding="longest", return_tensors="pt")
    return src, tgt
...
for step, batch in train_bar:
    src, tgt = _tokenize(batch)
    src["input_ids"] = src["input_ids"].to(accelerator.device)
    tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
    loss = outputs.loss
    loss = loss / gradient_accumulation_steps
    accelerator.backward(loss)
    if step % gradient_accumulation_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    if step % eval_steps == 0:
        model.eval()
        for step, batch in enumerate(dev_dataloader):
            src, tgt = _tokenize(batch)
            src["input_ids"] = src["input_ids"].to(accelerator.device)
            tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
            with torch.no_grad():
                predictions = model.generate(
                    src["input_ids"],
                    decoder_start_token_id=tokenizer.convert_tokens_to_ids("[CLS]"),
                    num_beams=4,
                    repetition_penalty=1.0,
                    do_sample=False,
                    forced_bos_token_id=None,
                )
            pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True)
            ref_str = tokenizer.batch_decode(tgt["input_ids"], skip_special_tokens=True)
            metric.add_batch(
                predictions=accelerator.gather(pred_str), references=accelerator.gather([[r] for r in ref_str]),
            )
        eval_metric = metric.compute()
...

I am getting the following error during training

  File "trainer.py", line 104, in training_function
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

and the following during generation

  File "trainer.py", line 120, in training_function
    predictions = model.generate(
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'generate'

Both are working fine if I change the configuration to use only one GPU using accelerate config

accelerator.gather() at training time

Can I use accelerator.gather() at training time? Would gradients be calculated properly? Basically my use case is something like below toy snippet. It seems that there is some issue with gradient flow in this scheme as my validation accuracy drops to 0.

model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
for i, data in enumerate(train_loader):
    model.zero_grad()
    
    a, b = model(data)
    b_all = accelerator.gather(b)
    c = f(a, b_all)
    loss = criterion(a, b, c)
    accelerator.backward(loss)
    optimizer.step()

Bug: Invalid arguments

Following to my PR to transformers repo, I've tried tested multi-GPU and multi-machine setting and got these two errors:

accelerate <command> [<args>] launch: error: argument --main_process_ip: invalid typing.Union[str, NoneType] value: {some_ip_address}

and

accelerate <command> [<args>] launch: error: argument --main_process_port: invalid typing.Union[int, NoneType] value: '{some_port_value}'

It seems to me this error can be fixed with changing type=Optional[str] -> type=str for --main_process_ip arg and type=Optional[int] -> type=int for --main_process_port arg.

@sgugger

TPU num_processes indentation error

In file https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/config/cluster.py there is an indentation error, leading to num_processes being undefined when using TPU.

if distributed_type == DistributedType.TPU:
        main_training_function = _ask_field(
            "What is the name of the function in your script that should be launched in all parallel scripts? [main]: ",
            default="main",
        )
    else:
        main_training_function = "main"

        num_processes = _ask_field(
            "How many processes in total will you use? [1]: ",
            lambda x: int(x),
            default=1,
            error_message="Please enter an integer.",
        )

        if distributed_type != DistributedType.TPU:
            fp16 = _ask_field(
                "Do you wish to use FP16 (mixed precision)? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
        else:
            fp16 = False

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.