Comments (14)
I've tried overriding scale_up
and scale_down
, but on the latter the argument received is relative to one worker, which may be in one pod that has workers that are not meant to be killed.
from dask-kubernetes.
Thanks for raising this.
Do you get the same issues when setting nthreads
instead of nprocs
?
from dask-kubernetes.
from dask-kubernetes.
@jacobtomlinson No, because nthreads
does not increase the number of works co-located inside the same pod. That just happens when you specify nprocs
, since it actually spins up more than one worker process inside the pod.
from dask-kubernetes.
@mamoit that's effectively what I thought.
@mrocklin yes I think that might be a good idea.
I cannot see any advantage to running multiple dask workers in the same pod vs running more pods. @mamoit do you have a use case which would benefit from this?
from dask-kubernetes.
I'm not sure if this is related, but I'm running batch jobs with --nthreads 1
for the worker, and cluster.adapt(1, 10)
, and when the cluster is about to scale down, I'm being flooded by these lines every second:
2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: CPU limit exceeded [988 occupancy / 2 cores]
2018-09-17 07:36:28 [INFO] distributed.scheduler: Suggest closing workers: ['tcp://***:35299']
2018-09-17 07:36:28 [INFO] distributed.deploy.adaptive: Attempting to scale up and scale down simultaneously.
from dask-kubernetes.
@fbergroth are you setting nprocs
to something other than 1?
from dask-kubernetes.
@jacobtomlinson no, I'm not touching nprocs
from dask-kubernetes.
FWIW, we've got the same kind of problem in dask-jobqueue, the current solution for me is to only specify multiples of nprocs
when scaling or using adaptive! But we should maybe raise a similar issue...
I agree with @mrocklin and @jacobtomlinson here, in a containerised environment, I don't see any benefit in running multiple similar process in a container vs running several containers.
In dask-jobqueue, long term solution would probably be to round up the requested number of workers modulo nprocs.
from dask-kubernetes.
@jacobtomlinson not really, it was just a lazy way of running more workers without changing much of the spec, but since I ran into this, thought it would be a good thing to share.
I'm now testing a "one cpu per pod" spec without specifying nprocs.
from dask-kubernetes.
Okay, so I talked to fast previously: adaptive case is perfectly handled in dask-jobqueue thanks to the use of a worker_key function in the adaptive options.
Just tested, works like a charm. It is however not used outside of adaptive, if I manually use scale(1)
several times, my cluster will stop and start workers.
So this is another option, but I still believe as @mrocklin and @jacobtomlinson.
from dask-kubernetes.
We moved away from nprocs
, since it is definitely not the right approach, and having one proc per pod is the kubernetes way to do it.
I think there should be a way to either not let this happen, or document for the users not to do the same mistake as I did.
from dask-kubernetes.
@mrocklin @jacobtomlinson and others. One case where I think it could be valuable to run multiple processes inside a single pod is when memory is an issue and you have multiple memory-intensive jobs that don't perfectly line up temporally with their memory usage. If you have multiple processes running on a single pod that is the size of a physical node, you can set the memory limit of that pod to roughly whatever memory is available on the node, and can allow different processes to use different amounts of memory at any given time. If you are running with multiple pods (each with one process), you have to set the memory limits of the pods to the max memory usage of each process. In this case, if you did happen to end up in a situation where the memory demands lined up, you wouldn't get a killed worker (which you minght want) but instead would get the workers using swap, correct?
Is there any way to set memory limits for each physical node (or for a group of workers), such that any workers on the node would be killed if, cumulatively, their memory usage surpassed a threshold?
from dask-kubernetes.
Now that the scaling logic has moved upstream to dask/distributed I'm going to close this out.
from dask-kubernetes.
Related Issues (20)
- Allow graceful shutdown HOT 3
- Dask dashboard not loading HOT 4
- Env var duplication HOT 2
- Ability to add different scheduler address to workers outside of standard format HOT 2
- Add a Changelog HOT 4
- Cluster creation constantly failing because of existing scheduler in "Terminating" status HOT 3
- Does dask-kubernetes compatible with newer version of k8rs? HOT 4
- Can not connect to k8s websocket deployed in Rancher HOT 5
- Update dask-kubernetes to a newer kr8s HOT 4
- Add Python 3.12 support HOT 1
- TOCTOU Bug while scaling down workers HOT 5
- Worker RestartPolicy not setable HOT 2
- Dask cluster creation issue with TLS HOT 1
- KubeCluster is shut down automatically even if shutdown_on_close is False HOT 1
- Go code failing to lint
- Dask Cluster with name longer than 53 chars is stuck in Created state, cannot be deleted
- Cannot Overwrite DASK_SCHEDULER_ADDRESS in Worker env HOT 1
- ConnectionClosedError during Dask Cluster Creation with k8s HOT 1
- Missing idleTimeout key in daskcluster_autoshutdown HOT 8
- Add IngressSpec besides ServiceSpec to Scheduler HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-kubernetes.