I have gone through the code and tested a couple of examples. I have a question on run

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Pytorch Elastic with NCCL backend about elastic HOT 5 CLOSED

pytorch commented on August 18, 2024

Pytorch Elastic with NCCL backend

from elastic.

Comments (5)

kiukchung commented on August 18, 2024 1

Thanks for the questions. The agent makes sure that the correct environment variables are set in the worker processes in such a way that it is torch-compliant (eg you can trivially initialize process groups).

By default the EtcdRendezvousHandler is used from the launcher.

In any topology, it is completely up to you on how you want to create process groups. Torchelastic won’t create process groups on your behalf. Currently in torch you need to first initalize a default process group that all workers belong to, then you can create new sub groups by using dist.new_group(ranks) (see https://pytorch.org/docs/stable/distributed.html#groups). Workers communicate with each other using the backend you specify when creating the process group. There are currently two backends: gloo and nccl.

Note that rendezvous (etcd) is NOT used to do any type of collective communication.

Hope this helps

from elastic.

kiukchung commented on August 18, 2024 1

Apologize for the late reply here. Please find my comments below

why existing rendezvous was not used. Is it because the existing rendezvous deals with static world_size

The architecture of how/when torchelastic performs rendezvous is different from how pytorch does it. Torchelastic performs rdzv per node whereas pytorch does this per worker.

EtcdRendezvous handler, that handler is registered using the Pytorch rendezvous. So it is not independent of Pytorch Rendezvous

Not exactly. The short answer is rendezvous in torchelastic is an implementation detail that the user need not worry about. We also defined our own rdzv handler interface (https://github.com/pytorch/elastic/blob/master/torchelastic/rendezvous/api.py#L42) . It just happens to be that torchelastic.rendezvous.EtcdRendezvous is compatible with pytorch because we both use functors to register handlers.

I am not quite sure when I submit the job how many workers will be used in each epoch.

An example will help illustrate this. Lets assume you specified min = 1, max = 4, and last_call_timeout = 30 sec (default). Your training will start as soon as it has 1 node (note "node" not "workers"). If the scheduler was able to give your job 2 nodes to start with it will start with 2 (all the way up to 4). last_call_timeout is the number of seconds to wait before closing the current "round". So if you started with 1 and within 15 seconds another node joined, then your job will start with 2 nodes.

During the course of your training, nodes can leave and join, and whenever these types of "scaling" events happen, torchelastic will "reset" the trainers and start a fresh "round" (again waiting 30 seconds for any nodes to join)

Hope this helps clarify a few things.

from elastic.

vibhatha commented on August 18, 2024

@kiukchung Thanks for the response.

I have a few questions from the current information,

When someone adds the init_method in initializing process groups (assume this is the only call to start up the training processes), if the init_method=env:// is passed, as long as we use PTE launcher, this config would be ignored in torch.dist rendezvous? (Referring to this method https://github.com/pytorch/pytorch/blob/3d8de74e176b2ab39cbf0adc2bdbe229d14bccfa/torch/distributed/rendezvous.py#L133). Please, correct me if I am wrong, I am still catching up with the code.
In my script I only do initialize the process groups at once.
And at launch time, I execute the python -m torchelastic.distributed.launch and expected params to start the program. Let's say I have 4 nodes and 2 processes per node. I am going to execute this script in all 4 nodes. As I specified there are 4 nodes and 2 processes per node, there will be total of 8 processes belonging to a single process group. Please correct me if I am wrong.
This process group manages the Rz's via the provided Rz handler, in this case the Pytorch Elastic EtcdRendezvous handler. So this handler make sure all the workers talk to each other. Additionally, there are 4 elastic agents, one per node. So each agent is responsible for the local ranks associated with this node. In this case, there are four worker-groups, each with 2 processes. Is this a valid statement for the given example?
In addition to this, if one needs to add a different key-value store (supporting another existing application which uses a specific key-store), a new rendezvous must be created? Meaning instead of the logic in the etcd_redezvous.py, one needs to write one custom_rendezvous.py and replace all ETCD calls with the other data store's API. But here, if one needs to extend one with the current code base, there could be code duplications as etcd calls are tightly coupled in the code. Please correct me if I am wrong.
Does Pytorch Elastic supports model parallel (scaling and fault-tolerance)?

from elastic.

kiukchung commented on August 18, 2024

@vibhatha please see my answers below:

The init_method=env:// maps to a different rendezvous handler. Typically without TorchElastic the user has to set the WORLD_SIZE and RANK env vars when launching the workers, since TorchElastic does this for you, you can simply set the init_method=env:// and not worry about rendezvous in your script's code or when you are launching your script.
If you launched on 4 nodes and set --num_procs_per_node=2 then you'll end up with a total of 8 worker processes. If in your script you did torch.distributed.init_process_group("env://"), then yes all 8 workers join the same process group.
To clarify pytorch's rendezvous and elastic's rendezvous are actually different interfaces (they do similar things but strictly speaking are unrelated and they look similar due to historical reasons - elastic used to use torch's rendezvous). Yes in your case there are 4 local worker groups each with 2 workers - keep in mind that torchelastic.WorkerGroup does not map to torch.distributed.ProcessGroup.
If you need a key-value store for your script, then you are free to use any key-value store of your choice. If you need TorchElastic's rendezvous to NOT use etcd, then yes you'll have to implement your own rendezvous handler and programmatically use our agent to hook up your rendezvous handler.
TorchElastic is agnostic to data vs model parallel. All it guarantees is that all workers are started/restarted at the same time. It is totally up to your script to be data or model parallel given that you understand the restart behavior of TorchElastic.

from elastic.

vibhatha commented on August 18, 2024

@kiukchung

Thanks for the response. I got some more questions on the current PyTorch Elastic model.

In 3, you mention "elastic used to use torch's rendezvous", why existing rendezvous was not used. Is it because the existing rendezvous deals with static world_size? But still, within the EtcdRendezvous handler, that handler is registered using the Pytorch rendezvous. So it is not independent of Pytorch Rendezvous? Please correct me if I am wrong.

In addition, I have a few questions related to the usage of a minimum and maximum workers. With the notion of the minimum number of workers, I am not quite sure when I submit the job how many workers will be used in each epoch. If the maximum workers are not used the resources are underutilized. What is the usefulness of the minimum and maximum worker number? I am not quite sure how to make use of these parameters.

from elastic.

Pytorch Elastic with NCCL backend about elastic HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent