After successfully running a ElasticJob on kubernetes, I would like to move the conten

I ended up creating a bash to take the arguments for torch.elastic and run my p

Wrapping elastic-job kubernetes in initcontainer about elastic HOT 4 CLOSED

pytorch commented on August 18, 2024

Wrapping elastic-job kubernetes in initcontainer

from elastic.

Comments (4)

mckunkel commented on August 18, 2024 1

I ended up creating a bash script to take the arguments for torch.elastic and run my pre and post training commands inside this script. I generalized the script enough to work on multiple cloud providers.
Thanks for the advice.

from elastic.

kiukchung commented on August 18, 2024

Could you upload the output directly from the trainers (as the last step in the script)? If you need one worker to upload then you could have rank 0 upload the output.

We’d also welcome changes to the kubernetes plugin to make init containers work!

from elastic.

mckunkel commented on August 18, 2024

I could, but I really rather keep the AWS:EKS/GCP:GKE calls native to their platform, because I already have automation established for output directories being set and I rather not require my students to have to code for the case of "worker-0" when kubernetes already knows who worker 0 is.

For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0;
Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.

Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?
Then I can wrap everything into a bash script, throw it into a config map and be done.

I tried running it all through a bash scrip, but the launch didn't like it.

from elastic.

kiukchung commented on August 18, 2024

I see, yea if you are working with multiple cloud providers, what you are trying to do makes sense.

For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0; Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.

Can you elaborate on the above (I think this was your initial question about the host not being reachable)? There is nothing specific to routing in torchelastic. If you can ping a host:port from one node, then you should be able to create a socket to the destination host in your training script.

FWIW you don't have to use the custom kubernetes controller for launching elastic jobs. You can just launch a bunch of pods and specify the rdzv backend and endpoint.

Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?

Yes (with a caveat) the we actually just use the ENTRYPOINT of the docker image that you specify in the pod spec yaml so you can build your own image with the ENTRYPOINT of your choice, the controller will however set some default args (--rdzv_backend, --rdzv_endpoint, --rdzv_id, --nnodes - see https://github.com/pytorch/elastic/blob/master/kubernetes/controllers/pod.go#L116) so as long as your entrypoint script is able to deal with those and pass them along to torchelastic, you can specify your own entrypoint script.

from elastic.

Wrapping elastic-job kubernetes in initcontainer about elastic HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent