Giter Club home page Giter Club logo

Comments (4)

mckunkel avatar mckunkel commented on August 18, 2024 1

I ended up creating a bash script to take the arguments for torch.elastic and run my pre and post training commands inside this script. I generalized the script enough to work on multiple cloud providers.
Thanks for the advice.

from elastic.

kiukchung avatar kiukchung commented on August 18, 2024

Could you upload the output directly from the trainers (as the last step in the script)? If you need one worker to upload then you could have rank 0 upload the output.

We’d also welcome changes to the kubernetes plugin to make init containers work!

from elastic.

mckunkel avatar mckunkel commented on August 18, 2024

I could, but I really rather keep the AWS:EKS/GCP:GKE calls native to their platform, because I already have automation established for output directories being set and I rather not require my students to have to code for the case of "worker-0" when kubernetes already knows who worker 0 is.

For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0;
Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.

Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?
Then I can wrap everything into a bash script, throw it into a config map and be done.

I tried running it all through a bash scrip, but the launch didn't like it.

from elastic.

kiukchung avatar kiukchung commented on August 18, 2024

I see, yea if you are working with multiple cloud providers, what you are trying to do makes sense.

For other modeling libraries, I usually only have to create a service in kubernetes and have all workers point to worker 0; Can something like this be done with the etcd service without elasticjob? I have tried, but torchelastic couldn't find the host.

Can you elaborate on the above (I think this was your initial question about the host not being reachable)? There is nothing specific to routing in torchelastic. If you can ping a host:port from one node, then you should be able to create a socket to the destination host in your training script.

FWIW you don't have to use the custom kubernetes controller for launching elastic jobs. You can just launch a bunch of pods and specify the rdzv backend and endpoint.

Can the entrypoint for elasticjob be modified such that it takes a script first then passes the args?

Yes (with a caveat) the we actually just use the ENTRYPOINT of the docker image that you specify in the pod spec yaml so you can build your own image with the ENTRYPOINT of your choice, the controller will however set some default args (--rdzv_backend, --rdzv_endpoint, --rdzv_id, --nnodes - see https://github.com/pytorch/elastic/blob/master/kubernetes/controllers/pod.go#L116) so as long as your entrypoint script is able to deal with those and pass them along to torchelastic, you can specify your own entrypoint script.

from elastic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.