Comments (5)
Thanks for the questions. The agent makes sure that the correct environment variables are set in the worker processes in such a way that it is torch-compliant (eg you can trivially initialize process groups).
By default the EtcdRendezvousHandler is used from the launcher.
In any topology, it is completely up to you on how you want to create process groups. Torchelastic wonβt create process groups on your behalf. Currently in torch you need to first initalize a default process group that all workers belong to, then you can create new sub groups by using dist.new_group(ranks) (see https://pytorch.org/docs/stable/distributed.html#groups). Workers communicate with each other using the backend you specify when creating the process group. There are currently two backends: gloo and nccl.
Note that rendezvous (etcd) is NOT used to do any type of collective communication.
Hope this helps
from elastic.
Apologize for the late reply here. Please find my comments below
why existing rendezvous was not used. Is it because the existing rendezvous deals with static world_size
The architecture of how/when torchelastic performs rendezvous is different from how pytorch does it. Torchelastic performs rdzv per node whereas pytorch does this per worker.
EtcdRendezvous handler, that handler is registered using the Pytorch rendezvous. So it is not independent of Pytorch Rendezvous
Not exactly. The short answer is rendezvous in torchelastic is an implementation detail that the user need not worry about. We also defined our own rdzv handler interface (https://github.com/pytorch/elastic/blob/master/torchelastic/rendezvous/api.py#L42) . It just happens to be that torchelastic.rendezvous.EtcdRendezvous
is compatible with pytorch because we both use functors to register handlers.
I am not quite sure when I submit the job how many workers will be used in each epoch.
An example will help illustrate this. Lets assume you specified min = 1, max = 4, and last_call_timeout
= 30 sec (default). Your training will start as soon as it has 1 node (note "node" not "workers"). If the scheduler was able to give your job 2 nodes to start with it will start with 2 (all the way up to 4). last_call_timeout
is the number of seconds to wait before closing the current "round". So if you started with 1 and within 15 seconds another node joined, then your job will start with 2 nodes.
During the course of your training, nodes can leave and join, and whenever these types of "scaling" events happen, torchelastic will "reset" the trainers and start a fresh "round" (again waiting 30 seconds for any nodes to join)
Hope this helps clarify a few things.
from elastic.
@kiukchung Thanks for the response.
I have a few questions from the current information,
-
When someone adds the init_method in initializing process groups (assume this is the only call to start up the training processes), if the init_method=env:// is passed, as long as we use PTE launcher, this config would be ignored in torch.dist rendezvous? (Referring to this method https://github.com/pytorch/pytorch/blob/3d8de74e176b2ab39cbf0adc2bdbe229d14bccfa/torch/distributed/rendezvous.py#L133). Please, correct me if I am wrong, I am still catching up with the code.
-
In my script I only do initialize the process groups at once.
And at launch time, I execute thepython -m torchelastic.distributed.launch
and expected params to start the program. Let's say I have 4 nodes and 2 processes per node. I am going to execute this script in all 4 nodes. As I specified there are 4 nodes and 2 processes per node, there will be total of 8 processes belonging to a single process group. Please correct me if I am wrong. -
This process group manages the Rz's via the provided Rz handler, in this case the Pytorch Elastic EtcdRendezvous handler. So this handler make sure all the workers talk to each other. Additionally, there are 4 elastic agents, one per node. So each agent is responsible for the local ranks associated with this node. In this case, there are four worker-groups, each with 2 processes. Is this a valid statement for the given example?
-
In addition to this, if one needs to add a different key-value store (supporting another existing application which uses a specific key-store), a new rendezvous must be created? Meaning instead of the logic in the etcd_redezvous.py, one needs to write one custom_rendezvous.py and replace all ETCD calls with the other data store's API. But here, if one needs to extend one with the current code base, there could be code duplications as etcd calls are tightly coupled in the code. Please correct me if I am wrong.
-
Does Pytorch Elastic supports model parallel (scaling and fault-tolerance)?
from elastic.
@vibhatha please see my answers below:
-
The
init_method=env://
maps to a different rendezvous handler. Typically without TorchElastic the user has to set theWORLD_SIZE
andRANK
env vars when launching the workers, since TorchElastic does this for you, you can simply set theinit_method=env://
and not worry about rendezvous in your script's code or when you are launching your script. -
If you launched on
4
nodes and set--num_procs_per_node=2
then you'll end up with a total of 8 worker processes. If in your script you didtorch.distributed.init_process_group("env://")
, then yes all 8 workers join the same process group. -
To clarify pytorch's rendezvous and elastic's rendezvous are actually different interfaces (they do similar things but strictly speaking are unrelated and they look similar due to historical reasons - elastic used to use torch's rendezvous). Yes in your case there are 4 local worker groups each with 2 workers - keep in mind that
torchelastic.WorkerGroup
does not map totorch.distributed.ProcessGroup
. -
If you need a key-value store for your script, then you are free to use any key-value store of your choice. If you need TorchElastic's rendezvous to NOT use etcd, then yes you'll have to implement your own rendezvous handler and programmatically use our agent to hook up your rendezvous handler.
-
TorchElastic is agnostic to data vs model parallel. All it guarantees is that all workers are started/restarted at the same time. It is totally up to your script to be data or model parallel given that you understand the restart behavior of TorchElastic.
from elastic.
Thanks for the response. I got some more questions on the current PyTorch Elastic model.
In 3, you mention "elastic used to use torch's rendezvous", why existing rendezvous was not used. Is it because the existing rendezvous deals with static world_size? But still, within the EtcdRendezvous handler, that handler is registered using the Pytorch rendezvous. So it is not independent of Pytorch Rendezvous? Please correct me if I am wrong.
In addition, I have a few questions related to the usage of a minimum and maximum workers. With the notion of the minimum number of workers, I am not quite sure when I submit the job how many workers will be used in each epoch. If the maximum workers are not used the resources are underutilized. What is the usefulness of the minimum and maximum worker number? I am not quite sure how to make use of these parameters.
from elastic.
Related Issues (20)
- Elastic agent doesn't detect worker failures in NCCL HOT 4
- Pytorch Lightning with TorchElastic - One worker doesn't start HOT 3
- Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic HOT 1
- Torch Elastic - How to make sure all nodes are in the same AZ? HOT 2
- Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0 HOT 7
- ModuleNotFoundError: No module named 'torch.distributed.elastic' HOT 4
- Out of Data documentation HOT 4
- Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) HOT 1
- Cannot reuse --rdzv_id between different elastic launch ?
- EtcdStore: AttributeError: can't set attribute HOT 1
- Kubernetes CustomResourceDefinition Moving out of Beta HOT 4
- submodule path docs/src/pytorch-sphinx-theme not in .gitmodules
- [feature request] petctl to support pulling script directory from github repo by commit or tag
- Is petctl also deprecated?
- [feature request] Add CPU example HOT 2
- Kubernetes: ttlSecondsAfterFinished not working in ElasticJob spec
- rendezvous: _matches_machine_hostname doesn't resolve hostnames fully HOT 2
- Please add more torch elastic training examples
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices HOT 4
- [examples/imagenet/main.py] Why doesn't elastic code contain gpu sync to compute performance, e.g. all_reduce
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastic.