If the number of dask processes in the pod spec is different from <code class="notrans

I've tried overriding scale_up and <code class="notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm not sure if this is related, but I'm running batch jobs with <code class="notransl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cluster.adapt does not behave when nprocs is > 1 about dask-kubernetes HOT 14 CLOSED

mamoit commented on May 29, 2024

cluster.adapt does not behave when nprocs is > 1

from dask-kubernetes.

Comments (14)

mamoit commented on May 29, 2024

I've tried overriding scale_up and scale_down, but on the latter the argument received is relative to one worker, which may be in one pod that has workers that are not meant to be killed.

from dask-kubernetes.

jacobtomlinson commented on May 29, 2024

Thanks for raising this.

Do you get the same issues when setting nthreads instead of nprocs?

from dask-kubernetes.

mrocklin commented on May 29, 2024

Maybe we should just disable the use of the `--nprocs` keyword for dask-workers in dask-kubernetes? I'd be curious to know if there was a use case for it.

…

On Mon, Sep 17, 2018 at 8:05 AM, Jacob Tomlinson ***@***.***> wrote: Thanks for raising this. Do you get the same issues when setting nthreads instead of nprocs? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszHsAGapQs6aAxYMgba2bTIm_vScAks5ub5AOgaJpZM4Wrnb8> .

from dask-kubernetes.

mamoit commented on May 29, 2024

@jacobtomlinson No, because nthreads does not increase the number of works co-located inside the same pod. That just happens when you specify nprocs, since it actually spins up more than one worker process inside the pod.

from dask-kubernetes.

jacobtomlinson commented on May 29, 2024

@mamoit that's effectively what I thought.

@mrocklin yes I think that might be a good idea.

I cannot see any advantage to running multiple dask workers in the same pod vs running more pods. @mamoit do you have a use case which would benefit from this?

from dask-kubernetes.

fbergroth commented on May 29, 2024

I'm not sure if this is related, but I'm running batch jobs with --nthreads 1 for the worker, and cluster.adapt(1, 10), and when the cluster is about to scale down, I'm being flooded by these lines every second:

2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: CPU limit exceeded [988 occupancy / 2 cores]
2018-09-17 07:36:28 [INFO] distributed.scheduler: Suggest closing workers: ['tcp://***:35299']
2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: Attempting to scale up and scale down simultaneously.

from dask-kubernetes.

jacobtomlinson commented on May 29, 2024

@fbergroth are you setting nprocs to something other than 1?

from dask-kubernetes.

fbergroth commented on May 29, 2024

@jacobtomlinson no, I'm not touching nprocs

from dask-kubernetes.

guillaumeeb commented on May 29, 2024

FWIW, we've got the same kind of problem in dask-jobqueue, the current solution for me is to only specify multiples of nprocs when scaling or using adaptive! But we should maybe raise a similar issue...

I agree with @mrocklin and @jacobtomlinson here, in a containerised environment, I don't see any benefit in running multiple similar process in a container vs running several containers.

In dask-jobqueue, long term solution would probably be to round up the requested number of workers modulo nprocs.

from dask-kubernetes.

mamoit commented on May 29, 2024

@jacobtomlinson not really, it was just a lazy way of running more workers without changing much of the spec, but since I ran into this, thought it would be a good thing to share.
I'm now testing a "one cpu per pod" spec without specifying nprocs.

from dask-kubernetes.

guillaumeeb commented on May 29, 2024

Okay, so I talked to fast previously: adaptive case is perfectly handled in dask-jobqueue thanks to the use of a worker_key function in the adaptive options.

Just tested, works like a charm. It is however not used outside of adaptive, if I manually use scale(1) several times, my cluster will stop and start workers.

So this is another option, but I still believe as @mrocklin and @jacobtomlinson.

from dask-kubernetes.

mamoit commented on May 29, 2024

We moved away from nprocs, since it is definitely not the right approach, and having one proc per pod is the kubernetes way to do it.
I think there should be a way to either not let this happen, or document for the users not to do the same mistake as I did.

from dask-kubernetes.

bolliger32 commented on May 29, 2024

@mrocklin @jacobtomlinson and others. One case where I think it could be valuable to run multiple processes inside a single pod is when memory is an issue and you have multiple memory-intensive jobs that don't perfectly line up temporally with their memory usage. If you have multiple processes running on a single pod that is the size of a physical node, you can set the memory limit of that pod to roughly whatever memory is available on the node, and can allow different processes to use different amounts of memory at any given time. If you are running with multiple pods (each with one process), you have to set the memory limits of the pods to the max memory usage of each process. In this case, if you did happen to end up in a situation where the memory demands lined up, you wouldn't get a killed worker (which you minght want) but instead would get the workers using swap, correct?

Is there any way to set memory limits for each physical node (or for a group of workers), such that any workers on the node would be killed if, cumulatively, their memory usage surpassed a threshold?

from dask-kubernetes.

jacobtomlinson commented on May 29, 2024

Now that the scaling logic has moved upstream to dask/distributed I'm going to close this out.

from dask-kubernetes.

cluster.adapt does not behave when nprocs is > 1 about dask-kubernetes HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent