Comments (4)
I ended up creating a bash script to take the arguments for torch.elastic and run my pre and post training commands inside this script. I generalized the script enough to work on multiple cloud providers.
Thanks for the advice.
from elastic.
Could you upload the output directly from the trainers (as the last step in the script)? If you need one worker to upload then you could have rank 0 upload the output.
We’d also welcome changes to the kubernetes plugin to make init containers work!
from elastic.
I could, but I really rather keep the AWS:EKS/GCP:GKE calls native to their platform, because I already have automation established for output directories being set and I rather not require my students to have to code for the case of "worker-0" when kubernetes already knows who worker 0 is.
For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0;
Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.
Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?
Then I can wrap everything into a bash script, throw it into a config map and be done.
I tried running it all through a bash scrip, but the launch didn't like it.
from elastic.
I see, yea if you are working with multiple cloud providers, what you are trying to do makes sense.
For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0; Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.
Can you elaborate on the above (I think this was your initial question about the host not being reachable)? There is nothing specific to routing in torchelastic. If you can ping a host:port from one node, then you should be able to create a socket to the destination host in your training script.
FWIW you don't have to use the custom kubernetes controller for launching elastic jobs. You can just launch a bunch of pods and specify the rdzv backend and endpoint.
Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?
Yes (with a caveat) the we actually just use the ENTRYPOINT of the docker image that you specify in the pod spec yaml so you can build your own image with the ENTRYPOINT of your choice, the controller will however set some default args (--rdzv_backend, --rdzv_endpoint, --rdzv_id, --nnodes
- see https://github.com/pytorch/elastic/blob/master/kubernetes/controllers/pod.go#L116) so as long as your entrypoint script is able to deal with those and pass them along to torchelastic, you can specify your own entrypoint script.
from elastic.
Related Issues (20)
- Elastic agent doesn't detect worker failures in NCCL HOT 4
- Pytorch Lightning with TorchElastic - One worker doesn't start HOT 3
- Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic HOT 1
- Torch Elastic - How to make sure all nodes are in the same AZ? HOT 2
- Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0 HOT 7
- ModuleNotFoundError: No module named 'torch.distributed.elastic' HOT 4
- Out of Data documentation HOT 4
- Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) HOT 1
- Cannot reuse --rdzv_id between different elastic launch ?
- EtcdStore: AttributeError: can't set attribute HOT 1
- Kubernetes CustomResourceDefinition Moving out of Beta HOT 4
- submodule path docs/src/pytorch-sphinx-theme not in .gitmodules
- [feature request] petctl to support pulling script directory from github repo by commit or tag
- Is petctl also deprecated?
- [feature request] Add CPU example HOT 2
- Kubernetes: ttlSecondsAfterFinished not working in ElasticJob spec
- rendezvous: _matches_machine_hostname doesn't resolve hostnames fully HOT 2
- Please add more torch elastic training examples
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices HOT 4
- [examples/imagenet/main.py] Why doesn't elastic code contain gpu sync to compute performance, e.g. all_reduce
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastic.