Giter Club home page Giter Club logo

Comments (14)

mamoit avatar mamoit commented on May 29, 2024

I've tried overriding scale_up and scale_down, but on the latter the argument received is relative to one worker, which may be in one pod that has workers that are not meant to be killed.

from dask-kubernetes.

jacobtomlinson avatar jacobtomlinson commented on May 29, 2024

Thanks for raising this.

Do you get the same issues when setting nthreads instead of nprocs?

from dask-kubernetes.

mrocklin avatar mrocklin commented on May 29, 2024

from dask-kubernetes.

mamoit avatar mamoit commented on May 29, 2024

@jacobtomlinson No, because nthreads does not increase the number of works co-located inside the same pod. That just happens when you specify nprocs, since it actually spins up more than one worker process inside the pod.

from dask-kubernetes.

jacobtomlinson avatar jacobtomlinson commented on May 29, 2024

@mamoit that's effectively what I thought.

@mrocklin yes I think that might be a good idea.

I cannot see any advantage to running multiple dask workers in the same pod vs running more pods. @mamoit do you have a use case which would benefit from this?

from dask-kubernetes.

fbergroth avatar fbergroth commented on May 29, 2024

I'm not sure if this is related, but I'm running batch jobs with --nthreads 1 for the worker, and cluster.adapt(1, 10), and when the cluster is about to scale down, I'm being flooded by these lines every second:

2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: CPU limit exceeded [988 occupancy / 2 cores]
2018-09-17 07:36:28 [INFO] distributed.scheduler: Suggest closing workers: ['tcp://***:35299']
2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: Attempting to scale up and scale down simultaneously.

from dask-kubernetes.

jacobtomlinson avatar jacobtomlinson commented on May 29, 2024

@fbergroth are you setting nprocs to something other than 1?

from dask-kubernetes.

fbergroth avatar fbergroth commented on May 29, 2024

@jacobtomlinson no, I'm not touching nprocs

from dask-kubernetes.

guillaumeeb avatar guillaumeeb commented on May 29, 2024

FWIW, we've got the same kind of problem in dask-jobqueue, the current solution for me is to only specify multiples of nprocs when scaling or using adaptive! But we should maybe raise a similar issue...

I agree with @mrocklin and @jacobtomlinson here, in a containerised environment, I don't see any benefit in running multiple similar process in a container vs running several containers.

In dask-jobqueue, long term solution would probably be to round up the requested number of workers modulo nprocs.

from dask-kubernetes.

mamoit avatar mamoit commented on May 29, 2024

@jacobtomlinson not really, it was just a lazy way of running more workers without changing much of the spec, but since I ran into this, thought it would be a good thing to share.
I'm now testing a "one cpu per pod" spec without specifying nprocs.

from dask-kubernetes.

guillaumeeb avatar guillaumeeb commented on May 29, 2024

Okay, so I talked to fast previously: adaptive case is perfectly handled in dask-jobqueue thanks to the use of a worker_key function in the adaptive options.

Just tested, works like a charm. It is however not used outside of adaptive, if I manually use scale(1) several times, my cluster will stop and start workers.

So this is another option, but I still believe as @mrocklin and @jacobtomlinson.

from dask-kubernetes.

mamoit avatar mamoit commented on May 29, 2024

We moved away from nprocs, since it is definitely not the right approach, and having one proc per pod is the kubernetes way to do it.
I think there should be a way to either not let this happen, or document for the users not to do the same mistake as I did.

from dask-kubernetes.

bolliger32 avatar bolliger32 commented on May 29, 2024

@mrocklin @jacobtomlinson and others. One case where I think it could be valuable to run multiple processes inside a single pod is when memory is an issue and you have multiple memory-intensive jobs that don't perfectly line up temporally with their memory usage. If you have multiple processes running on a single pod that is the size of a physical node, you can set the memory limit of that pod to roughly whatever memory is available on the node, and can allow different processes to use different amounts of memory at any given time. If you are running with multiple pods (each with one process), you have to set the memory limits of the pods to the max memory usage of each process. In this case, if you did happen to end up in a situation where the memory demands lined up, you wouldn't get a killed worker (which you minght want) but instead would get the workers using swap, correct?

Is there any way to set memory limits for each physical node (or for a group of workers), such that any workers on the node would be killed if, cumulatively, their memory usage surpassed a threshold?

from dask-kubernetes.

jacobtomlinson avatar jacobtomlinson commented on May 29, 2024

Now that the scaling logic has moved upstream to dask/distributed I'm going to close this out.

from dask-kubernetes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.