Giter Club home page Giter Club logo

Comments (58)

bhack avatar bhack commented on September 13, 2024

/cc @windreamer

from ecosystem.

windreamer avatar windreamer commented on September 13, 2024

Hmm, it is a wild idea... maybe the most difficult part is how to map GPUs into different tasks?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

/cc @yuefengz

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Ping

from ecosystem.

windreamer avatar windreamer commented on September 13, 2024

It seems docker is not a must. it seems we can just using GPU allocator of Mesos to allocate resources. But experiments are still on-going.

from ecosystem.

yuefengz avatar yuefengz commented on September 13, 2024

Sorry, was just back from vacation last week and didn't see your issue. Will take a look at your issue shortly.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@yuefengz Thank you. @windreamer has already experimented the optional docker dependency in tfmesos.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Obviously I think that TPU resources are available only in managed google cloud and not in DC/OS on google compute engine right?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

/cc @klueska

from ecosystem.

klueska avatar klueska commented on September 13, 2024

It seems that the question is about GPU support in DC/OS? Is that correct?

GPUs will be supported in the upcoming DC/OS 1.9 release with the limitations outlined in this pull request: dcos/dcos#766

The full documentation is not yet complete (it will be ready by 1.9 EA release coming out in early February though). Here is a preview: https://github.com/dcos/dcos-docs/blob/a20548c343a75258ea70799efb9d98c9c6aeeaf7/1.8/usage/gpu-support.md

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska Yes cause actually docker it is mandatory

from ecosystem.

klueska avatar klueska commented on September 13, 2024

What do you mean by "docker is mandatory". Can you give a little more detail on the context?

DC/OS actually supports two different container runtimes (The Docker Container Runtime and the Universal Container Runtime), both of which are able to run docker containers. The Docker Container Runtime does not yet support GPUs, but the Universal Container Runtime does.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Yes I know and tfmesos use both the solutions but see the flavour on how it is integrated here: https://github.com/tensorflow/ecosystem/blob/master/marathon/README.md

from ecosystem.

klueska avatar klueska commented on September 13, 2024

I see. Looking at https://github.com/tensorflow/ecosystem/blob/master/marathon/template.json.jinja
I see no reason that this can't be modified to run with the Universal Container Runtime instead of the Docker Container Runtime (a.k.a. the docker containerizer).

The only change would be to update the container type to "MESOS" and allocate it some "gpus: xxx" if desired.

In fact, that would be the recommended way of running it going forward.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Yes and this originated my original question. I'am also courious on how new hardware devices like TPU or other kind of accellerators could be exposed in the Universal Container Runtime.

from ecosystem.

klueska avatar klueska commented on September 13, 2024

@windreamer Regarding:
Hmm, it is a wild idea... maybe the most difficult part is how to map GPUs into different tasks?

You could consider launching all of the tasks as part of a task-group (aka Pod), in which case they would all share access to the same set of GPUs instead of having to allocate a different GPU to each of them.

from ecosystem.

klueska avatar klueska commented on September 13, 2024

@bhack Regarding:

Yes and this originated my original question. I'am also courious on how new hardware devices like TPU or other kind of accellerator could be exposed in the Universal Container Runtime.

That sounds like a great question for the mesos development list: http://mesos.apache.org/community/

There is nothing fundamental about how we do GPU allocation that couldn't be extended to TPUs.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska Yes but I don't have a TPU :) Only google has this ASIC and I think will be only available in managed cloud so we cannot see it as Mesos resource also on Google Cloud.

from ecosystem.

yuefengz avatar yuefengz commented on September 13, 2024

Thanks @klueska ! I will try it when DC/OS 1.9 is available: The only change would be to update the container type to "MESOS" and allocate it some "gpus: xxx" if desired.

from ecosystem.

yuefengz avatar yuefengz commented on September 13, 2024

@bhack We will give more details about TPU but for now TPU is beyond the scope. We don't have to worry about supporting TPU, especially in Mesos, in the near future.

from ecosystem.

windreamer avatar windreamer commented on September 13, 2024

@bhack I am quite positive docker is not necessary for tfmesos cluster. mesos uses cgroup's device whitelist to isolate GPUs from each task. So when a GPU is allocated, only the task it assinged to can use this device.

I thought docker was a must is based on a guess: Tensorflow uses GPUs starting with id 0, and when with nvidia-docker GPUs are automatically re-numbered in docker. But based on my experiments, it seemd Tensorflow also do EnumerateDevices to discover all available GPUs. And thus re-numbering GPU devices is not nessesary.

from ecosystem.

windreamer avatar windreamer commented on September 13, 2024

But please note most our node has only one GPU, so my experiment may have been wrongly done.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@yuefengz Thank you. This mean that TPU will be available only on managed Google Cloud for now. Right?

from ecosystem.

jhseu avatar jhseu commented on September 13, 2024

Note that we can't comment on any TPU plans :)

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@jhseu Ok we wait you to be ready to release more info. Instead, will be there an impact of XLA in Mesos?

from ecosystem.

jhseu avatar jhseu commented on September 13, 2024

It's unlikely XLA will directly affect Mesos. The primary benefits of XLA are:

  • Improved performance (from fusing ops, faster op kernels)
  • Easier support for making new hardware work in TensorFlow

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Mhh probably on task migration XLA could require to emit new native code or another ops fusing strategy. What do you think?

from ecosystem.

jhseu avatar jhseu commented on September 13, 2024

XLA shouldn't affect task migration because each worker will JIT compile native code when ops are executed for the first time.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Ok so JIT will be executed on each worker as required.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska Is 1.9 ea released right? Could we update the container type in this repository example and related docs?

from ecosystem.

klueska avatar klueska commented on September 13, 2024

@bhack Yes, DC/OS 1.9 EA was released last night. Give it a shot and let me know if there are any problems.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@jhseu @yuefengz?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

The patch it is really trivial but I want to ask to maintainers if they have a cluster to test it.

from ecosystem.

jhseu avatar jhseu commented on September 13, 2024

Yeah, we can set up a cluster to test it, so feel free to send a pull request.

from ecosystem.

klueska avatar klueska commented on September 13, 2024

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Ok please comment #38

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@windreamer @klueska Any plan to push Tensorflow on DC/OS universe?

from ecosystem.

klueska avatar klueska commented on September 13, 2024

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska Take a look at TF distributed

from ecosystem.

bhack avatar bhack commented on September 13, 2024

It is also important tensorflow/tensorflow#2126

from ecosystem.

bhack avatar bhack commented on September 13, 2024

#38 is merged. I've opened mesosphere/universe#1026 for the universe topic.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

I want to ask to all if makes sense to integrate more pymesos and tfmesos features here. What do you think?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Another probable alternative is https://github.com/daskos/mentos/ /cc @daskos

from ecosystem.

bhack avatar bhack commented on September 13, 2024

DC/OS 1.9 stable release is generally available

from ecosystem.

bhack avatar bhack commented on September 13, 2024

I see no activity here. Any further idea or action? Do we want to close it?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska Is there any other feedback on how to improve DC/OS Mesos experience?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

How this news hints can improve the mesos demo? Could be created another example on a "real" dataset like imagenet?

from ecosystem.

bhack avatar bhack commented on September 13, 2024

This thread it is quite dead. Try to see if there is any interest by @dcos team.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

In the meantime https://dcos.io/blog/2017/tutorial-deep-learning-with-tensorflow-nvidia-and-apache-mesos-dc-os-part-1/index.html /cc @sascala

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@jhseu After Google I/O I could suppose that TPU could not be related anymore to Ecosystem cause it is full managed. Is this conclusion correct? If you adopt ecosystem/mesos or DC/OS your are out of TPU device offering.

from ecosystem.

saeta avatar saeta commented on September 13, 2024

@bhack you should be able to run a TensorFlow job on a DC/OS cluster running on GCE nodes, and connect to a Cloud TPU using distributed TensorFlow.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Is it really an integration? Can I put TPU in https://github.com/tensorflow/ecosystem/blob/master/marathon/template.json.jinja#L9?

from ecosystem.

jhseu avatar jhseu commented on September 13, 2024

@bhack The initial release will work with vanilla VMs, so a cluster manager may still be useful.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Ok I will close.this in the next days.. seems to me that there is no input by other stakeholders to improve Mesos/Dcos resources in this repository.

from ecosystem.

klueska avatar klueska commented on September 13, 2024

I played around with tfmesos quite a bit, but never quite got it working in a way that I liked on DC/OS.

We are now working on a distributed Tensorflow framework built using the DC/OS SDK https://github.com/mesosphere/dcos-commons

Once it's ready, that will likely be the suggested way to run Tensorflow on DC/OS.

I will update this repo with instructions on how to use it once it's ready.

from ecosystem.

bhack avatar bhack commented on September 13, 2024

Ok I leave this open so we can discuss a little bit what info we want to put here when the DC/OS work will be ready.

from ecosystem.

klueska avatar klueska commented on September 13, 2024

I know this thread is closed, but I wanted to point out the new release of distributed TensorFlow on DC/OS that we announced today. https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/

from ecosystem.

bhack avatar bhack commented on September 13, 2024

@klueska so what do you plan to do with the k8 official effort?
As DC/OS declared full k8 support just some weeks ago.

from ecosystem.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.