Giter Club home page Giter Club logo

k8sclustermanagers.jl's People

Contributors

christopher-dg avatar ericphanson avatar kimlaberinto avatar omus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8sclustermanagers.jl's Issues

Revise interface to be inline with Distributed.jl

Currently the K8sClusterManager uses this interface to spawn workers:

pids = K8sClusterManagers.addprocs_pod(2; namespace="my-namespace")

An interface which would be more inline with Distributed.jl would be:

pids = addprocs(K8sClusterManager(2; namespace="my-namespace"))

`error: couldn't find any field with path "{.metadata.labels.worker-id}"`

Occurred in CI for README only change: #59

[ Info: Waiting for test-multi-addprocs job. This could take up to 4 minutes...
error: couldn't find any field with path "{.metadata.labels.worker-id}" in the list of objects
test-multi-addprocs: Error During Test at /home/runner/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/cluster.jl:234
  Got exception outside of a @test
  failed process: Process(`/home/runner/.julia/artifacts/e549ab3a763d3b31e726aa6336c6dbb75ee90a05/bin/kubectl get pods -l manager=test-multi-addprocs-wlpv5 '-o=jsonpath={range .items[*]}{.metadata.name}{"\n"}{end}' '--sort-by={.metadata.labels.worker-id}'`, ProcessExited(1)) [1]
  
  Stacktrace:
    [1] pipeline_error
      @ ./process.jl:525 [inlined]
    [2] read(cmd::Cmd)
      @ Base ./process.jl:412
    [3] read(cmd::Cmd, #unused#::Type{String})
      @ Base ./process.jl:421
    [4] readchomp
      @ ./io.jl:923 [inlined]
    [5] (::var"#19#21"{Cmd, String, String})(exe::String)
      @ Main ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/utils.jl:49
    [6] (::JLLWrappers.var"#2#3"{var"#19#21"{Cmd, String, String}, String})()
      @ JLLWrappers ~/.julia/packages/JLLWrappers/bkwIo/src/runtime.jl:49
    [7] withenv(::JLLWrappers.var"#2#3"{var"#19#21"{Cmd, String, String}, String}, ::Pair{String, String}, ::Vararg{Pair{String, String}, N} where N)
      @ Base ./env.jl:161
    [8] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
      @ JLLWrappers ~/.julia/packages/JLLWrappers/bkwIo/src/runtime.jl:48
    [9] #invokelatest#2
      @ ./essentials.jl:708 [inlined]
   [10] invokelatest
      @ ./essentials.jl:706 [inlined]
   [11] #kubectl#7
      @ ~/.julia/packages/JLLWrappers/bkwIo/src/products/executable_generators.jl:7 [inlined]
   [12] kubectl
      @ ~/.julia/packages/JLLWrappers/bkwIo/src/products/executable_generators.jl:7 [inlined]
   [13] pod_names(labels::Pair{String, SubString{String}}; sort_by::String)
      @ Main ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/utils.jl:48
   [14] macro expansion
      @ ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/cluster.jl:265 [inlined]
   [15] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [16] top-level scope
      @ ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/cluster.jl:235
   [17] include(fname::String)
      @ Base.MainInclude ./client.jl:444
   [18] macro expansion
      @ ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/runtests.jl:27 [inlined]
   [19] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [20] macro expansion
      @ ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/runtests.jl:27 [inlined]
   [21] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [22] top-level scope
      @ ~/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/runtests.jl:20
   [23] include(fname::String)
      @ Base.MainInclude ./client.jl:444
   [24] top-level scope
      @ none:6
   [25] eval
      @ ./boot.jl:360 [inlined]
   [26] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:261
   [27] _start()
      @ Base ./client.jl:485

https://github.com/beacon-biosignals/K8sClusterManagers.jl/pull/59/checks?check_run_id=2470070043

quickly return WorkerPool, add workers later

Getting many requested pods can trigger scale-up which takes time.

Currently this is dealt with with a timeout; any requested pods that do not stand up and connect by the timeout are dropped, and launch returns with however many pods have come up by the timeout. This can be awkward.

An alternative way to deal with spin-up slowness is to return a WorkerPool quickly, maybe as soon as there is one worker connected, and continue adding workers to that pool after returning.

To be practical, this method should have a worker initialization hook, so that workers only join the workerpool after evaling some quoted code in Main, typically using Packages commands and other definitions.

`IOError: connect: connection refused (ECONNREFUSED)`

[ Info: test-multi-addprocs-x7wr9-worker-c55hc is up
[ Info: test-multi-addprocs-x7wr9-worker-sldrh is up
ERROR: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:532
     [2] connect
       @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [4] connect(manager::K8sClusterManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [5] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [6] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [7] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:364
 [2] macro expansion
   @ ./task.jl:383 [inlined]
 [3] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [5] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [6] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [7] top-level scope
   @ none:11

I originally noticed this error as part of another PR (#44 (comment)) but have also observed this failure when removing this sleep call.

I believe what is happening is that we are waiting for the pod to be running but attempt to connect to the pod before the worker actually starts listening. Since the original failure as shown above occurred before this sleep(2) call was removed probably there is some variability in how long it takes Julia to start listening.

Run cluster tests in parallel

The cluster tests are by far the slowest part of our test suite. As these tests run on k8s they provide us with an opportunity for running these testsets in parallel. Specifically, what we can do is we can start the jobs/pods all at the same time and if the cluster the these are run on can handle them all running in parallel they can all run at once. We can then have the test suites wait for their respective jobs/pods to finish and evaluate them normally.

[Feature Request] Support for running inside a namespace

Hello,

I would like to use K8sClusterManagers.jl inside a predefined namespace to prevent accidental (or voluntary) damage to other infrastructures inside kubernetes.

Currently this is not possible. It fails as soon as get_pod is run - it asks for the pod name globally, which is not allowed by its RBAC role.
https://github.com/beacon-biosignals/K8sClusterManagers.jl/blob/main/src/pod.jl#L29

Could this be added to the project or is it out of scope?

Delete completed worker pods on deregister

We currently launch pods for workers and after the pod has succeeded or failed the pod will still be visible in the pod list from kubectl get pods. For pods that fail we should probably refrain from deleting them to assist with postmortem debugging but for succeeded pods we may want to automatically clean them up.

Pods don't support a TTL but completed pods (succeeded or failed) are eventually garbage collected.

Command from README is deprecated

❯ kubectl run -it example-manager-pod --image julia:1

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.

Code coverage for cluster tests

We're missing out on some coverage of code that is tested due to the code paths being run on k8s pods. We should be able to include the coverage from these tests if we do the following on the CI:

As --code-coverage=user generates code coverage files right besides the source code we need to be executing our pod code inside the PV. Doing this ensure the code coverage files are written to the PV without having to manually move files around. The cleanest way I can think to do this is to use minikube mount to mount the CI host directory for K8sClusterManagers.jl to the node(s) and then use PV hostPath to mount that directory to the package location inside of the pod. This approach should mean that code coverage files generated on the pod are directly written back to the CI host.

However, there are some potential issues to work through:

  • Using PV hostPath requires an access mode of ReadWriteOnce which eliminates any parallelism we may want to do on the CI (#53)
  • Using minikube mount with multiple nodes probably isn't supported. We can probably use one node on CI if we're careful to use partial CPU resources
  • The PIDs used on the pods are quite likely to identical across tests which means the coverage files may end up having conflicting names.

Improve error messaging when using an incorrect namespace

When specifying an incorrect namespace you get the following error:

julia> pids = K8sClusterManagers.addprocs_pod(2; namespace="ondawerks")
[ Info: unsupported crd.k8s.amazonaws.com/v1alpha1
[ Info: unsupported vpcresources.k8s.aws/v1beta1
ERROR: Swagger.ApiException(403, "Forbidden", HTTP.Messages.Response:
"""
HTTP/1.1 403 Forbidden
Audit-Id: dd0173b7-7fe9-4ffc-811f-0eca71fdc943
Cache-Control: no-cache, private
Content-Length: 392
Content-Type: application/json
Date: Fri, 19 Mar 2021 18:04:26 GMT
X-Content-Type-Options: nosniff
Connection: close

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"driver-2021-03-19--17-58-47-kpqnt\" is forbidden: User \"system:serviceaccount:project-ondawerks:ondawerks-service-account\" cannot get resource \"pods\" in API group \"\" in the namespace \"ondawerks\"","reason":"Forbidden","details":{"name":"driver-2021-03-19--17-58-47-kpqnt","kind":"pods"},"code":403}
""")
Stacktrace:
 [1] exec(::Swagger.Ctx) at /root/.julia/packages/Swagger/4eKnR/src/client.jl:213
 [2] readCoreV1NamespacedPod(::Kuber.Kubernetes.CoreV1Api, ::String, ::String; pretty::Nothing, exact::Nothing, __export__::Nothing, _mediaType::Nothing) at /root/.julia/packages/Kuber/UfXri/src/api/api_CoreV1Api.jl:3912
 [3] #readNamespacedPod#735 at /root/.julia/packages/Kuber/UfXri/src/apialiases.jl:2026 [inlined]
 [4] readNamespacedPod(::Kuber.Kubernetes.CoreV1Api, ::String, ::String) at /root/.julia/packages/Kuber/UfXri/src/apialiases.jl:2026
 [5] macro expansion at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:79 [inlined]
 [6] macro expansion at /root/.julia/packages/Retry/vS1bg/src/repeat_try.jl:192 [inlined]
 [7] get(::Kuber.KuberContext, ::Symbol, ::String, ::Nothing; max_tries::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:78
 [8] get at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:62 [inlined] (repeats 2 times)
 [9] self_pod at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:19 [inlined]
 [10] default_pod(::Kuber.KuberContext, ::Int64, ::Cmd, ::String; image::Nothing, memory::String, cpu::String, serviceAccountName::Nothing, base_obj::Kuber.Kubernetes.IoK8sApiCoreV1Pod, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:35
 [11] (::K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext})(::Int64) at ./none:0
 [12] iterate at ./generator.jl:47 [inlined]
 [13] _all at ./reduce.jl:827 [inlined]
 [14] all at ./reduce.jl:823 [inlined]
 [15] Dict(::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:130
 [16] #default_pods_and_context#2 at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:67 [inlined]
 [17] K8sNativeManager(::UnitRange{Int64}, ::String, ::Cmd; configure::Function, namespace::String, retry_seconds::Int64, kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:82
 [18] addprocs_pod(::Int64; driver_name::String, configure::Function, namespace::String, image::Nothing, serviceAccountName::Nothing, memory::String, cpu::String, retry_seconds::Int64, exename::Cmd, exeflags::Cmd, params::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:193
 [19] top-level scope at REPL[4]:1
caused by [exception 3]
UndefVarError: readPod not defined
Stacktrace:
 [1] top-level scope
 [2] eval at ./boot.jl:331 [inlined]
 [3] eval at /root/.julia/packages/Kuber/UfXri/src/Kuber.jl:1 [inlined]
 [4] get(::Kuber.KuberContext, ::Symbol, ::String, ::Nothing; max_tries::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:66
 [5] get at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:62 [inlined] (repeats 2 times)
 [6] self_pod at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:19 [inlined]
 [7] default_pod(::Kuber.KuberContext, ::Int64, ::Cmd, ::String; image::Nothing, memory::String, cpu::String, serviceAccountName::Nothing, base_obj::Kuber.Kubernetes.IoK8sApiCoreV1Pod, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:35
 [8] (::K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext})(::Int64) at ./none:0
 [9] iterate at ./generator.jl:47 [inlined]
 [10] _all at ./reduce.jl:827 [inlined]
 [11] all at ./reduce.jl:823 [inlined]
 [12] Dict(::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:130
 [13] #default_pods_and_context#2 at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:67 [inlined]
 [14] K8sNativeManager(::UnitRange{Int64}, ::String, ::Cmd; configure::Function, namespace::String, retry_seconds::Int64, kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:82
 [15] addprocs_pod(::Int64; driver_name::String, configure::Function, namespace::String, image::Nothing, serviceAccountName::Nothing, memory::String, cpu::String, retry_seconds::Int64, exename::Cmd, exeflags::Cmd, params::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:193
 [16] top-level scope at REPL[4]:1
caused by [exception 2]
Swagger.ApiException(403, "Forbidden", HTTP.Messages.Response:
"""
HTTP/1.1 403 Forbidden
Audit-Id: 2f0534e1-0734-4736-b5ef-2236b7f7c670
Cache-Control: no-cache, private
Content-Length: 392
Content-Type: application/json
Date: Fri, 19 Mar 2021 18:04:26 GMT
X-Content-Type-Options: nosniff
Connection: close

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"driver-2021-03-19--17-58-47-kpqnt\" is forbidden: User \"system:serviceaccount:project-ondawerks:ondawerks-service-account\" cannot get resource \"pods\" in API group \"\" in the namespace \"ondawerks\"","reason":"Forbidden","details":{"name":"driver-2021-03-19--17-58-47-kpqnt","kind":"pods"},"code":403}
""")
Stacktrace:
 [1] exec(::Swagger.Ctx) at /root/.julia/packages/Swagger/4eKnR/src/client.jl:213
 [2] readCoreV1NamespacedPod(::Kuber.Kubernetes.CoreV1Api, ::String, ::String; pretty::Nothing, exact::Nothing, __export__::Nothing, _mediaType::Nothing) at /root/.julia/packages/Kuber/UfXri/src/api/api_CoreV1Api.jl:3912
 [3] #readNamespacedPod#735 at /root/.julia/packages/Kuber/UfXri/src/apialiases.jl:2026 [inlined]
 [4] readNamespacedPod(::Kuber.Kubernetes.CoreV1Api, ::String, ::String) at /root/.julia/packages/Kuber/UfXri/src/apialiases.jl:2026
 [5] macro expansion at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:79 [inlined]
 [6] macro expansion at /root/.julia/packages/Retry/vS1bg/src/repeat_try.jl:192 [inlined]
 [7] get(::Kuber.KuberContext, ::Symbol, ::String, ::Nothing; max_tries::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:78
 [8] get at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:62 [inlined] (repeats 2 times)
 [9] self_pod at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:19 [inlined]
 [10] default_pod(::Kuber.KuberContext, ::Int64, ::Cmd, ::String; image::Nothing, memory::String, cpu::String, serviceAccountName::Nothing, base_obj::Kuber.Kubernetes.IoK8sApiCoreV1Pod, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:35
 [11] (::K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext})(::Int64) at ./none:0
 [12] iterate at ./generator.jl:47 [inlined]
 [13] grow_to!(::Dict{Any,Any}, ::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:139
 [14] dict_with_eltype at ./abstractdict.jl:540 [inlined]
 [15] Dict(::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:128
 [16] #default_pods_and_context#2 at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:67 [inlined]
 [17] K8sNativeManager(::UnitRange{Int64}, ::String, ::Cmd; configure::Function, namespace::String, retry_seconds::Int64, kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:82
 [18] addprocs_pod(::Int64; driver_name::String, configure::Function, namespace::String, image::Nothing, serviceAccountName::Nothing, memory::String, cpu::String, retry_seconds::Int64, exename::Cmd, exeflags::Cmd, params::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:193
 [19] top-level scope at REPL[4]:1
caused by [exception 1]
UndefVarError: readPod not defined
Stacktrace:
 [1] top-level scope
 [2] eval at ./boot.jl:331 [inlined]
 [3] eval at /root/.julia/packages/Kuber/UfXri/src/Kuber.jl:1 [inlined]
 [4] get(::Kuber.KuberContext, ::Symbol, ::String, ::Nothing; max_tries::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:66
 [5] get at /root/.julia/packages/Kuber/UfXri/src/simpleapi.jl:62 [inlined] (repeats 2 times)
 [6] self_pod at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:19 [inlined]
 [7] default_pod(::Kuber.KuberContext, ::Int64, ::Cmd, ::String; image::Nothing, memory::String, cpu::String, serviceAccountName::Nothing, base_obj::Kuber.Kubernetes.IoK8sApiCoreV1Pod, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:35
 [8] (::K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext})(::Int64) at ./none:0
 [9] iterate at ./generator.jl:47 [inlined]
 [10] grow_to!(::Dict{Any,Any}, ::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:139
 [11] dict_with_eltype at ./abstractdict.jl:540 [inlined]
 [12] Dict(::Base.Generator{UnitRange{Int64},K8sClusterManagers.var"#3#4"{typeof(identity),String,Cmd,Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}},Kuber.KuberContext}}) at ./dict.jl:128
 [13] #default_pods_and_context#2 at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:67 [inlined]
 [14] K8sNativeManager(::UnitRange{Int64}, ::String, ::Cmd; configure::Function, namespace::String, retry_seconds::Int64, kwargs::Base.Iterators.Pairs{Symbol,Any,NTuple{4,Symbol},NamedTuple{(:image, :memory, :cpu, :serviceAccountName),Tuple{Nothing,String,String,Nothing}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:82
 [15] addprocs_pod(::Int64; driver_name::String, configure::Function, namespace::String, image::Nothing, serviceAccountName::Nothing, memory::String, cpu::String, retry_seconds::Int64, exename::Cmd, exeflags::Cmd, params::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/K8sClusterManagers/eAv58/src/native_driver.jl:193
 [16] top-level scope at REPL[4]:1

Setting the namespace to "project-ondawerks" works properly

pattern to `rmprocs` when there's no more work to do

This isn't strictly a K8sClusterManagers.jl issue, but @omus pointed me here :).

I was running hyperparameter optimization on a model using @phyperopt from Hyperopt.jl with pmap=Parallelism.robust_pmap from Parallelism.jl. I would spin up the desired number of workers with addprocs, then essentially call pmap via these abstractions, and then that's it. When the pmap is done, the manager writes out a summary and exits, and all the processors are released.

I wanted to train 20 models this way quickly, so I did this with 20 workers and left them to train. However, some finished much faster than others, and those processors were left idling. Since this is via k8s, if we killed them, we could have in-scaled and saved lots of resources.

It would be great to have something like pmap that could automatically remove processors when they were no longer needed.

FR: Ergonomic way to inherit driver labels on workers

Or otherwise easily apply cluster-autoscaler.kubernetes.io/safe-to-evict: "false", which is likely desirable whenever running a single computational job/script (as opposed to a service that should be downscaled when idle etc).

CI Cluster Test failed

Not sure why this CI cluster test failed:

[ Info: Waiting for test-multi-addprocs job. This could take up to 4 minutes...
 Error from server (NotFound): pods "test-multi-addprocs-st7tc" not found
test-multi-addprocs: Error During Test at /home/runner/work/K8sClusterManagers.jl/K8sClusterManagers.jl/test/cluster.jl:279
  Test threw exception
  Expression: pod_phase(manager_pod) == "Succeeded"

https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/5027583871?check_suite_focus=true#step:9:140

Unable to determine pod name

@kolia reported this issue with [email protected]:

julia> addprocs(K8sClusterManager(n_workers; pending_timeout=180, memory="1Gi"))
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-z4jjs is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-fvt5b is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-jt2dv is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-gwtsp is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-pv5dw is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-dlzxx is up
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:497
 [3] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [4] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [6] top-level scope
   @ REPL[16]:1
    nested task error: TaskFailedException
        nested task error: Unable to determine the pod name from: ""
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:33
         [2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
           @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/pod.jl:68
         [3] macro expansion
           @ ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:107 [inlined]
         [4] (::K8sClusterManagers.var"#29#31"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
           @ K8sClusterManagers ./task.jl:411
    ...and 25 more exceptions.
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:369
     [2] macro expansion
       @ ./task.jl:388 [inlined]
     [3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
       @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:105
     [4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:411

Pending pods can make `launch` stall

Sometimes a few pods that are Pending once retry_seconds is up make the cleanup stall for some reason, needs investigation.

In the mean time, when that happens, it's safe to Ctrl-C interrupt and use whatever procs() have been successfully provisioned, and clean up the pending stragglers with kubectl delete.

document caveats/warts

@kolia mentioned there are some undocumented caveats here e.g. behavior when more resources than available are requested.

Test worker termination during launch

Currently the cluster tests don't include having a worker terminate during the launch procedure. I expect the current behaviour would cause the launch to be terminated due to an uncaught exception being raised (either while waiting for the pod to be running or while the manager attempts to connect).

have tests! + test coverage > 50%

k0s is a single executable kubernetes distro that will easily launch a working cluster locally

We could:

  • make a k0s_jll with BinaryBuilder
  • add a test dependency to it, run k0s in the background
  • depend on kubectl_jll, run kubectl proxy in the background
  • write tests that can run locally

Delete cluster test pods when successful

Currently the cluster tests do not clean up the jobs/pods they create. This is definitely useful for investigating failures via kubectl but when tests are successful they are just leaving a mess behind. We should setup the tests to automatically delete the resources created by the tests only if they were successful.

`env` kwarg in `addprocs` not supported

addprocs(machines, ...) documentation describes a keyword argument env, which accepts an iterable of pairs and set the environment variables on the remote machine(s) appropriately. this is possible to do manually with configure but it might be a nice convenience to support the abbreviation as well.

Support a `Revise` workflow on workers

devspace sync is an easy local install that establishes a 2-way sync between local folders and folders in containers running in k8s. julia_pod uses this to sync the current julia project folder and with its equivalent in the running container, making it possible to update code locally and use Revise against julia REPL running in k8s.

This convenient development workflow breaks down when using K8sClusterManagers to spin up workers, because workers see separate file systems.

Because of this, it is currently recommended to always test Distributed code using local Distributed procs first before running on k8s.

If devspace sync is installed, we could set up one or more syncs between relevant local and worker folders for each worker as it is spun up, maybe defaulting to the current julia project folder. This would make it possible to use Revise and develop code against code using workers directly.

back Julia cluster instances via a pod controller rather than creating new pods individually?

IIUC K8sClusterManagers currently kubectl creates individual pods whenever a worker is added.

I think it would make more sense from a K8s perspective to associate an actual pod controller with the "Julia cluster instance", and then update the underlying set of pods that back the Julia cluster instance via kubectl applying updates to that controller. This is philosophically more aligned with the way sets of pods are intended to be orchestrated in K8s land AFAIU, and would hopefully enable some configuration/resilience possibilities at the level of the whole cluster-instance and not just at the pod level (e.g. tuning pod scaling behaviors for the cluster instance, migrating workers from failed nodes, etc.)

IIRC the driver Julia process is already backed by a Job so maybe that'd be sufficient? I think StatefulSet is worth considering too. We should tour the built-in controllers and see which ones might make the most sense; especially w.r.t. being able to tell the controller to make additional pods available w/o interfering w/ existing ones

This would probably be massive overkill, but if none of the built-in K8s controllers are sufficient (which I guess is a possibility), you could even imagine a custom K8s operator implemented specifically for this purpose (JuliaClusterInstance)

Allow configuration option to not pass worker stdout/stderr back to manager

I.e. I think it would be great if the manager could store a Bool for whether or not we want to run these lines:

# Redirect any stdout/stderr from the worker to be displayed on the manager.
# Note: `start_worker` (via `--worker`) automatically redirects stderr to
# stdout.
p = open(detach(`$(kubectl()) logs -f pod/$pod_name`), "r+")
config = WorkerConfig()
config.io = p.out

It probably should default to true to match Base and to help when getting started, although I suspect many users will want to turn it off once their projects have reached some level of maturity.

Why would someone want to turn it off? Folks might already have logging setup on pods by default (like we do at Beacon), in which case having the manager's logs be filled by the workers + it's own is redundant and overly verbose, since you can easily get the workers logs by looking at their pod output. Additionally, the way Base uses this IO object is it prints the result, without using the logging system. That means there's no opportunity to customize or direct the output with the logging system.

Allow specifying minimum number of workers

As of the fix in #69 if workers are not spawned by the pending_timeout then the cluster manager will proceed with the workers currently available. In some scenarios this may be unacceptable so having a way of specifying a minimum number of workers may be convenient.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Running `addprocs` more than once fails

If you run:

addprocs(K8sClusterManagers(1))
addprocs(K8sClusterManagers(1))

The second addprocs command will always fail:

[ Info: unsupported events.k8s.io/v1
[ Info: unsupported certificates.k8s.io/v1
[ Info: unsupported node.k8s.io/v1
[ Info: unsupported flowcontrol.apiserver.k8s.io/v1beta1
[ Info: test-success-v7zw4-worker-9001 is up
[ Info: unsupported events.k8s.io/v1
[ Info: unsupported certificates.k8s.io/v1
[ Info: unsupported node.k8s.io/v1
[ Info: unsupported flowcontrol.apiserver.k8s.io/v1beta1
[ Info: test-success-v7zw4-worker-9001 is up
ERROR: TaskFailedException
    nested task error: IOError: connect: address not available (EADDRNOTAVAIL)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1082
     [2] worker_from_id
       @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1079 [inlined]
     [3] #remote_do#154
       @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [4] remote_do
       @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/remotecall.jl:486 [inlined]
     [5] kill
       @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:675 [inlined]
     [6] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:593
     [7] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406
    caused by: IOError: connect: address not available (EADDRNOTAVAIL)
    Stacktrace:
     [1] uv_error
       @ ./libuv.jl:97 [inlined]
     [2] connect!(sock::Sockets.TCPSocket, host::Sockets.IPv4, port::UInt16)
       @ Sockets /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:508
     [3] connect
       @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Sockets/src/Sockets.jl:566 [inlined]
     [4] connect_to_worker(host::String, port::Int64)
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:639
     [5] connect(manager::K8sClusterManager, pid::Int64, config::WorkerConfig)
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/managers.jl:566
     [6] create_worker(manager::K8sClusterManager, wconfig::WorkerConfig)
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:589
     [7] setup_launched_worker(manager::K8sClusterManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:534
     [8] (::Distributed.var"#41#44"{K8sClusterManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:406
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:364
 [2] macro expansion
   @ ./task.jl:383 [inlined]
 [3] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:480
 [4] addprocs_locked
   @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [5] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [6] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [7] top-level scope
   @ none:9

Note in particular the second info message which has the same pod name as the first:

[ Info: test-success-v7zw4-worker-9001 is up

What is happening is that the second addprocs call is attempting to create a resource that already exists. Adding some logging to the try/catch that creates the worker pod the following error message can be seen repeatedly:

┌ Error: Swagger.ApiException(409, "Conflict", HTTP.Messages.Response:
│ """
│ HTTP/1.1 409 Conflict
│ Cache-Control: no-cache, private
│ Content-Length: 238
│ Content-Type: application/json
│ Date: Fri, 23 Apr 2021 13:37:38 GMT
│ X-Kubernetes-Pf-Flowschema-Uid: ff1ab154-3b2b-45a1-82d5-041f25047af3
│ X-Kubernetes-Pf-Prioritylevel-Uid: 7005aead-8f3e-47db-a16b-65bab3d416cb
│ Connection: close
│
│ {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"test-success-lwr4t-worker-9001\" already exists","reason":"AlreadyExists","details":{"name":"test-success-lwr4t-worker-9001","kind":"pods"},"code":409}
│ """)
└ @ K8sClusterManagers ~/.julia/dev/K8sClusterManagers/src/native_driver.jl:136

Unfortunately, this error message cannot normally be seen and also we shouldn't be attempting to re-create resources that already exist. Ideally we should always be trying to create unique pod names to avoid this situation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.