admiraltyio / admiralty Goto Github PK

A system of Kubernetes controllers that intelligently schedules workloads across clusters.

License: Apache License 2.0

Go 83.78% Shell 15.66% Smarty 0.55%

admiralty's Issues

Controller manager crashes if pod is annotated but namespace isn't labeled

If pod is annotated with multicluster.admiralty.io/elect="" in a namespace not labeled with multicluster-scheduler=enabled, it gets scheduled to a local node. The annotation is what some controllers look at to filter proxy pods. In this case, they erroneously consider the normal pod to be a proxy pod. Then, they use the node name as a key to find a client for a target cluster, which doesn't exist...

The pod's scheduler name should be used as a more reliable filter instead.
When targets are removed, nodes should be removed, and pods evicted, so we don't encounter the same issue then.
We might want to rethink selectors, e.g., pod label only?

Uninstall hangs. Post-delete Helm hook crash-loops.

This is a regression introduced in v0.8.2. The patch isn't formatted right, missing quotes: https://github.com/admiraltyio/admiralty/blame/75cf62603a6f6417bbe2aca561bcaa89298aac7e/cmd/remove-finalizers/main.go#L34

In the meantime, run this instead to remove finalizers (the hook's job, as a script):

echo 'metadata:
  $deleteFromPrimitiveList/finalizers:
    - multicluster.admiralty.io/multiclusterForegroundDeletion' > patch.yaml

for kind in service pod configmap secret
do
  kubectl get $kind --all-namespaces -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -n %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name .metadata.namespace }}{{ end }}' | sh
done

GKE deployment

Hello,

I tried to deploy admiralty on two GKE clusters.

During virtual kubelet creation on source cluster, I get following error in node description:

Warning FailedToCreateRoute 25m route_controller Could not create route 0d3a65bd-97b9-4959-aa67-99874ab6317e 10.116.2.0/24 for node admiralty-plm-c7ff8eb6-target-cluster-3e3933bad4 after 233.845278ms: instance not found

The node is tainted as following:

Taints: node.kubernetes.io/network-unavailable:NoSchedule
virtual-kubelet.io/provider=admiralty:NoSchedule

The pods deployed in a namespace with a multicluster.admiralty.io/elect: "" annotation set stay in Pending State.

Is GKE a supported target for multicluster scheduler deployment ?

Thanks

If namespace doesn't exist in one target cluster, proxy scheduler fails

The proxy pod scheduler's filter plugin creates candidate pods in target clusters and waits for those pods to be reserved or unschedulable, and filters the corresponding virtual node accordingly. If it is not allowed to create a candidate in a target cluster, it knows to filter out the corresponding virtual node. But if it receives any other error, including if the namespace doesn't exist, the entire scheduling cycle fails, when only that virtual node should be filtered out.

Schedule to a node rather than a cluster

Theoretically, given all the PodObservations and NodeObservations, multicluster-scheduler should have enough information to decide which node in a target cluster a delegate pod should be bound to. However, for now, it only decides which cluster, and leaves it up to that cluster's default (or custom) single-cluster scheduler to bind the delegate pod to a node.

The current design allows the user to leverage any single-cluster scheduler's advanced scheduling features. However, there is no guarantee that the target cluster will be able to schedule the delegate pod, because the multi-cluster decision didn't account for all of the single-cluster constraints. Also, the time from multi-cluster pod creation to running state could be reduced by combining the two scheduling steps.

Multicluster-scheduler should account for all constraints and directly schedule to a node, to increase reliability and performance. This means that multicluster-controller should use the default Kubernetes scheduler and also kube-batch as libraries.

Note: working with SIG-Scheduling to define common interfaces would help.

Source controller doesn't work if Helm release's name isn't multicluster-scheduler

Source cluster role references in the source controller are hard-coded as multicluster-scheduler-source and multicluster-scheduler-cluster-summary-viewer, which are their names if the Helm release is named multicluster-scheduler, but may be different otherwise.

We need to pass those names from the Helm chart to the controller, e.g., as environment variables.

Pod with names longer than 63 characters can't be scheduled

The feedback controller relies on the multicluster.admiralty.io/parent-name label value to reconcile remote pod chaperons with proxy pods (because their names differ). However, label values are limited to 63 characters, and pod names can be as long as 253 characters. Therefore, candidate pod chaperon creations, which include the needed label, fail for proxy pods whose names are longer than 63 characters.

We need to use annotations instead of labels for remote controller references.

To select controlled objects given a parent object, we can still use the parent UID label, which is short enough.

Note: We had fixed this before in multicluster-controller, but the feedback controller doesn't use it.

Publish images for other architectures, especially arm64

cc @jjgraham

Currently we only release amd64 images. We've released arm64 images unofficially (with suffixed tags, e.g., 0.8.0-arm64), but what we should really do is publish a multi-arch manifest, so that the normal version tag, e.g., 0.8.0 would pull an arm64 image on an arm64 node and an amd64 image on an amd64 node.

Propagate events involving delegate pods from target to source cluster

Feature request: propagate the events from pod running remotely to the proxy pod, so that user can monitor all pods without logging in to remote clusters. Also to see if the remote pod is having some kind of error in the local proxy pod.

Agent not working after fresh install on EKS

agent logs show
Federation namespace: multifederation
controller.go:160: cannot update *v1alpha1.NodePoolObservation eks-cluster-default in namespace multifederation in scheduler cluster: Operation cannot be fulfilled on nodepoolobservations.multicluster.admiralty.io "eks-cluster-default": the object has been modified; please apply your changes to the latest version and try again
controller.go:161: Could not reconcile Request. Stop working.

Virtual kubelet nodes with huge capacity interfere with kube-dns autoscaler

Some Kubernetes distributions enable DNS horizontal autoscaling by default, e.g., a kube-dns-autoscaler deployment running cluster-proportional-autoscaler autoscales the kube-dns deployment in GKE. The scale depends on the number of nodes and the number of CPU cores (a rough indication of the load on system components, assuming those nodes/cores aren't idle).

Here is a common configuration: linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'.

Our virtual-kubelet nodes have a dummy, very high CPU capacity, 100,000 cores, because the candidate scheduler in the corresponding target clusters actually enforce capacity constraints, based on nodes there. We only need the virtual-kubelet nodes capacities to be equal or greater than the actual capacities of the clusters that they represent. A quick and good-enough solution was to set those numbers very high.

However, with DNS horizontal autoscaling, after multicluster-scheduler is installed, hundreds of kube-dns replicas are added, overloading kube-scheduler and starving the source cluster. This issue can be fixed at two levels:

A simple documentation fix is to tell users to either disable DNS horizontal autoscaling (not ideal), or, better, configure it to ignore virtual-kubelet nodes, with the --nodelabels option. AKS does that for ACI nodes and EKS for Fargate nodes.
To lower the risk of breaking a cluster while installing multicluster-scheduler, we should implement a control loop to update a virtual-kubelet node's capacity as the sum of the capacities of the corresponding target cluster's nodes.

cc @verchol

Finalizers added to all secrets and config maps

Multicluster-scheduler should only add its cross-cluster garbage collection finalizers to secrets and config maps that are copied to other clusters.

At least, it should only add finalizers to those resources in labeled namespaces.

@dimm0

If delegate Pod is deleted and scheduler cluster is unavailable, delegate Pod is not replaced

If a delegate Pod is deleted, e.g., evicted, multicluster-scheduler's agent creates a new one, based on the owning PodDecision in the scheduler cluster. However, if the scheduler cluster is unavailable (common in poor connectivity use cases), nothing will replaces the deleted Pod until connectivity is reestablished.

Therefore, I propose a new custom resource, Delegates: an intermediary between PodDecisions and delegate Pods, colocated with delegate Pods and owning/controlling them.

delegate pods always created at scheduler cluster

Having issue with scheduling pods on any cluster. The delegate pods always get created at the scheduler cluster even with annotation set to the other cluster.

Setup: 2 clusters created on top of bare-metal with cilium running(No mesh cluster yet). c1 is the scheduler. deploy pods on c2 and get delegate pods on c1. Great! but it still delegates to c1 with annotation explicitly set to c2. If I try the other way around: deploy pods on c1 and it also delegates to c1 even with the annotation set.

I want to fix this and at least able to get delegate pods on c2 when I deploy on c1. any suggestions? Thanks!

kubernetes version: 1.6.0
cilium version: 1.6.3

deploy pods on c2
expected: yes

$ kubectl --context c2 get pods
NAME                      READY   STATUS    RESTARTS   AGE
nginx-69c855d8f5-hp54d    1/1     Running   0          10m
nginx-69c855d8f5-lb578    1/1     Running   0          10m
nginx-69c855d8f5-zb5d6    1/1     Running   0          10m
ubuntu-56978cc9cb-7ql2q   1/1     Running   0          19s

$ kubectl --context c1 get pods
NAME                                 READY   STATUS    RESTARTS   AGE
c2-default-nginx-69c855d8f5-hp54d    1/1     Running   0          10m
c2-default-nginx-69c855d8f5-lb578    1/1     Running   0          10m
c2-default-nginx-69c855d8f5-zb5d6    1/1     Running   0          10m
c2-default-ubuntu-56978cc9cb-7ql2q   1/1     Running   0          71s

deploy pods on c2
expected: no, the delegate pods c1-default-nginx... should be created at c2

$ kubectl  --context c1 get pods --watch
NAME                                 READY   STATUS    RESTARTS   AGE
c1-default-nginx-956d89fb-5w4bb      1/1     Running   0          17s
c1-default-nginx-956d89fb-gsvv6      1/1     Running   0          17s
c1-default-nginx-956d89fb-wx7fh      1/1     Running   0          17s
c2-default-ubuntu-56978cc9cb-7ql2q   1/1     Running   1          46h
nginx-956d89fb-5w4bb                 1/1     Running   0          18s
nginx-956d89fb-gsvv6                 1/1     Running   0          18s
nginx-956d89fb-wx7fh                 1/1     Running   0          17s```

Logs are flooded with "successfully synced" messages about no-ops

It appears that all config maps and secrets in all namespaces are "Successfully synced ..." even though nothing was done for most of them. (Only the config maps and secrets referred to by proxy pods are copied to target clusters.)

The message in question is an info log from the processNextWorkItem function of the controller package. If Handle is successful and doesn't require requeuing, the "Successfully synced" message is printed. The follow controllers, in particular, enqueue all config maps and secrets and filter them in their respective Handle functions, called from processNextWorkItem. We could filter config maps and secrets in event handlers, so processNextWorkItem is only called for relevant config maps and secrets. We could also just remove the confusing info log (not super useful).

agent cannot be restarted

Reported by @tghartland, who noted that the agent pod cannot be force deleted either.

The agent pod has a pod observation itself. Therefore, it has a finalizer (for cross-cluster garbage collection). The deployment's strategy is "recreate", meaning there can only be one agent running. When the agent pod is deleted (before a new one can be started according to the strategy), the finalizer isn't removed, because that would have been the agent's responsibility... The "recreate" strategy may not be necessary though, so a new agent could remove the finalizer of its predecessor. If there cannot be two concurrent agents (esp. for virtual-kubelet), we need leader election, except for the finalizer cleaning part... A hacky solution could also be to not observe the agent. But anyway, observations and decisions may not exist in the next version.

Proxy pod appears running even if delegate pod isn't running yet

The proxy pod is rapidly scheduled by its local cluster's scheduler (the default one or the one defined in the spec). This is misleading to the user, because the delegate pod may take longer to start in its target cluster (or stay pending if it's unschedulable).

The pod admission webhook should replace the scheduler name in the proxy pod's spec, so the agent can act as its single-cluster scheduler and only bind it to a node when the corresponding delegate pod is bound to a node in the target cluster.

Configure targets at runtime with new Target CRD

Add fields and update status of custom resources to keep user informed

Scalability: test performance under load

Consider leveraging Kubemark.

Controller manager and proxy scheduler crash if one Target is misconfigured.

If one target is misconfigured, e.g., if the referenced kubeconfig secret doesn't exist, the controller manager and proxy scheduler crash (and restart, until configuration is fully ready). This was by design, following the "fail fast, not silently" pattern. It turns out that users would rather have a partially working system—understandably, especially when a misconfigured target is added to a set of properly configured targets.

So we should skip a target if it is misconfigured, but load it as soon as it is properly configured. This means we need to watch for kubeconfig secret changes in the restarter.

Config maps and secrets should follow pods

Jobs run indefinitely with `multicluster.admiralty.io/elect: ""` annotation

Jobs scheduled with the multicluster.admiralty.io/elect: "" never reach completion.

As an example, I set up the multicluster scheduler on two openstack nodes following the instructions in the README, then scheduled the attached sample job taken from the kubernetes docs with the multicluster.admiralty.io/elect: "" annotation. The job should finish in ~10s, but is still running after 12 hours:

[root@node1 ~]# kubectl get pods | grep pi
pi-6hkw9                                       1/1     Running   0          12h

cilium pods keep at pending status

Followed the README.md for setting up multicluster-scheduler with two kind clusters(cilium-1.8.1 as their CNI, no cluster mesh ), virtual nodes have been created in cluster1, but cilium pods keep at pending status... how to solve this problem?

ps: everingthing is ok when using kind's default cni("kindnetd")

# kubectl --context "$CLUSTER1" get node
NAME                     STATUS   ROLES     AGE   VERSION
admiralty-c1             Ready    cluster   21m
admiralty-c2             Ready    cluster   21m
cluster1-control-plane   Ready    master    85m   v1.18.2
cluster1-worker          Ready    <none>    84m   v1.18.2

# kubectl --context "$CLUSTER2" get node
NAME                     STATUS   ROLES    AGE   VERSION
cluster2-control-plane   Ready    master   83m   v1.18.2
cluster2-worker          Ready    <none>   83m   v1.18.2

# kubectl --context "$CLUSTER1" get node -l virtual-kubelet.io/provider=admiralty
NAME           STATUS   ROLES     AGE   VERSION
admiralty-c1   Ready    cluster   20m
admiralty-c2   Ready    cluster   20m

Pending pods:

# kubectl --context "$CLUSTER1" get pods --all-namespaces -o wide | grep -i pending
kube-system          cilium-6mrvl                                                  0/1     Pending   0          22m   <none>       admiralty-c1             <none>           <none>
kube-system          cilium-node-init-7jsvq                                        0/1     Pending   0          22m   <none>       admiralty-c2             <none>           <none>
kube-system          cilium-node-init-s89xs                                        0/1     Pending   0          22m   <none>       admiralty-c1             <none>           <none>
kube-system          cilium-vsnnh                                                  0/1     Pending   0          22m   <none>       admiralty-c2             <none>           <none>
# kubectl --context "$CLUSTER1" describe po cilium-6mrvl -n kube-system
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  32m   default-scheduler  0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Too many pods, 3 node(s) didn't match node selector.
  Normal   Scheduled         32m   default-scheduler  Successfully assigned kube-system/cilium-6mrvl to admiralty-c1

Ingresses should follow pods

More specifically, they should follow services that follow pods.

One use case is for integration with Admiralty Cloud's DNS-based global server load balancing (GSLB) solution.

Avoid name collisions and truncate names

The way names for controlled objects are formatted can result in collisions and length errors.

For example, virtual nodes for namespaced targets are named using this format: fmt.Sprintf("admiralty-namespace-%s-%s", target.Namespace, target.Name), which would result in the same name for target baz in the foo-bar namespace and target bar-baz in the foo namespace. A common way to fix this is to append a hash of fmt.Sprintf("%s/%s", target.Namespace, target.Name) at the end, which cannot collide because slashes aren't allowed in namespaces and names.

Furthermore, if the name of the target is very long, the resulting virtual node name could exceed the limit of 253 characters. A common solution is to truncate the result and keep some room for a hash (to avoid prefix collisions).

Source role bindings are also affected.

Provide guidance regarding resource requests and limits

In a cluster with 120 nodes and 2600 pods, the agent pod is OOM killed before it can send any observation. Increasing the default memory limit fixes the issue, but we need to provide guidance on how much memory is needed as a function of node and pod cardinality.

The Proxy Pod Logs an Error

Summary

I built clusters and configured them as the document says. The proxy pods works as expected on syncing status. But the logs of the proxies are not. I am not sure if it is in line with expectations.

How to Reproduction?

my environment variables show below:

export CLUSTER1=cluster1-admin@cluster1
export CLUSTER2=cluster2-admin@cluster2
export NAMESPACE=default

I applied a deployment to specify running the pod on my $CLUSTER2:

# remote-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: say
spec:
  replicas: 1
  selector:
    matchLabels:
      app: say
  template:
    metadata:
      labels:
        app: say
      annotations:
        multicluster.admiralty.io/elect: ""
        multicluster.admiralty.io/clustername: cluster2-admin@cluster2
    spec:
      containers:
      - name: say
        image: docker/whalesay:latest
        command: [cowsay]
        args: ["hello world"]

After it was finish running. I printed the logs of the pod in $CLUSTER2. It showed as expected.

$ kubectl --context $CLUSTER2 logs say-6895d46c5c-pllx4-84wrs
 _____________
< hello world >
 -------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

But the logs of the pod in $CLUSTER1 is different:

$ kubectl --context $CLUSTER1 logs say-6895d46c5c-pllx4
Error from server: no preferred addresses found; known addresses: []

node may not have pre-allocated hugepages for multiple page sizes

Given a source-target cluster connection, if the target cluster supports multiple huge page sizes (feature gate HugePageStorageMediumSize enabled), but the source cluster doesn't—in particular, if the target cluster runs Kubernetes 1.19 (feature gate enabled by default) and the source cluster runs 1.18 or lower—the cluster summary in the target cluster will include aggregated capacity and allocatable resources for multiple huge page sizes, but when the upstream resource controller tries to apply those values to the virtual node corresponding to the target cluster, kube-apiserver rejects the status update. The virtual node remains with zero capacity and allocatable resources. Therefore, pods can't be scheduled to the target clusters because the virtual node doesn't pass the standard resource filter in the proxy scheduler.

Pods with init containers cannot run in remote clusters

reported by @dimm0

The pod chaperon controller in the target cluster fails with error syncing 'NAMESPACE/POD_NAME': Pod "POD_NAME" is invalid: spec.initContainers[0].volumeMounts[1].name: Not found: "default-token-gnnn6", requeuing. In turn, the candidate scheduler times out, and the proxy scheduler times out.'

Service account token volume mounts are removed from pod specs when pod chaperons are created, to avoid this issue. Service accounts and their tokens are different in different clusters. Delegate pods use service accounts found in their target clusters.

The bug is that volume mounts aren't removed for init containers.

Create a Helm NOTES.txt

To guide the user after helm install.

https://helm.sh/docs/chart_template_guide/notes_files/

List next steps (configure cross-cluster authn, multi-cluster scheduling), and give a link to the docs.

<=0.2.0 CRD manifest incompatible with Kubernetes >=1.11

(⎈ |nautilus:myazdani)➜  ~ kubectl apply -f  https://github.com/admiraltyio/multicluster-scheduler/releases/download/v0.2.0/scheduler.yaml

namespace/multicluster-scheduler created

error: error validating "https://github.com/admiraltyio/multicluster-scheduler/releases/download/v0.2.0/scheduler.yaml": error validating data: [ValidationError(CustomResourceDefinition.status): missing required field "conditions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionStatus, ValidationError(CustomResourceDefinition.status): missing required field "storedVersions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionStatus]; if you choose to ignore these errors, turn validation off with --validate=false

k8s v1.13.3

Service rerouting doesn't reroute updated service

Service rerouting transforms a selector like:

selector:
  a: b

into

selector:
  multicluster.admiralty.io/a: b

to match delegate pods instead of proxy pods.

However, when the service is updated, or the original configuration is re-applied by the user, the selector becomes:

selector:
  multicluster.admiralty.io/a: b
  a: b

Service rerouting should modify the selector again to make it look like the previous snippet. The controller's shouldReroute function looks at the service's Endpoints object to see if it targets proxy pods, but at that time, it doesn't match anything.

Proposed solution: don't use the Endpoints, list pods directly with a filtered selector (keys not prefixed with our domain), or, when the service is already rerouted (i.e., the selector contains only keys prefixed with our domain), the original selector saved as an annotation.

fail to schedule pending pods

I'm trying to write my own scheduler with kubernetes client python to schedule the pod. These pod are already submitted to the cluster and select my own scheduler. So they are pending in the clsuter. I patch annotation

{"multicluster.admiraly.io/clustername":str_cluster,'multicluster.admiralty.io/elect':''}

to the pod and expect the pod is scheduled to <str_cluster>. But it failed to schedule.

Then I check the pod yaml (not complete).

The result of kubectl describe pod nginx-4 is

But the pod status is still pending, and the virtual pod does't appear.

Is multicluster-scheduler able to schedule the pending pod? How can I do with it?
Thanks.

If target is namespaced, feedback controller doesn't work

The feedback controller tries to watch remote pod chaperons at the cluster level even if the target is configured to be namespaced, which fails if not allowed. The controller should use a namespaced shared informer factory.

Support kubectl logs, exec, top using virtual-kubelet

@sfiligoi

Does Admirality support Open ID Connect - OIDC

I'm trying to deploy Admirality to IBM Kubernetes Service (IKS) (IBM Cloud) and an On-Premise Datacenter (x86).

When I attempt to register the cluster with Admirality, it returns the following error message:

IBM Cloud uses Open ID Connect, does Admirality support this?

Upgrade Kubernetes libraries to 1.18

This is a prerequisite for integrating with other projects in the Kubernetes ecosystem that have already upgraded. With client-go's major breaking change, we all need to get on the same page.

Here's what the upgrade involves:

client-go added context to many of its method signatures. We need to pipe any available context to client-go and create contexts when none is available (as a temporary measure - ideally we would want all contexts to inherit from a parent in main).
virtual-kubelet hasn't upgraded to 1.18 yet. Here's a PR for that. In the meantime, we can use the fork.
The scheduler framework's PostFilter plugin has not only been renamed to PreScore, but it also doesn't run anymore if len(filteredNodes) == 0, which is precisely when we needed it, for the candidate scheduler to inform the proxy scheduler that a candidate is unschedulable, via annotations. We can rely on pod status conditions instead (watch those instead of our annotation in the proxy scheduler's Filter plugin).
We can actually rely on pod conditions to inform the proxy scheduler that a candidate is bound or that it failed. The only annotations that we really need are reserved/allowed.
We use the Permit plugin in the candidate scheduler to wait until the proxy scheduler elects a candidate, and in the proxy scheduler to wait until the elected candidate is bound or not. In 1.18, the Permit plugin now runs in the scheduling cycle, and the context is cancelled by the time we poll (in a goroutine), so we can't use the context to stop polling. As suggested by @Huang-Wei, it's actually simpler to poll in the PreBind plugin. No need for a goroutine then. Note: we should watch and reconcile rather than poll, but that's out of scope.

Custom resource validation with OpenAPI and webhooks

proxy pods stuck in terminating

Reported by @tghartland, who noted that force deleting works.

Looks like the last status of delegate pods isn't fed back to proxy pods, so even though proxy pods have a deletion timestamp and no more finalizer, their status shows that all containers are still running, so they can't be deleted gracefully.

This is a regression. Pre-v0.6.0 and virtual-kubelet's adoption, proxy pods were deleted by sending a signal to their dummy containers.

Test with cert-manager 1.0 and update docs

Upgrade should be smooth: https://cert-manager.io/docs/release-notes/release-notes-1.0/

Set up CI

run unit tests
verify code gen
go vet
build Docker images
run e2e tests

Scripts already exist. Just automate them on pushes and PRs to master with e.g. GitHub Actions.

E2E test is flaky in CI

Even though it famously "works on my machine":tm:!

PR run succeeds: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/146754445
Merge commit fails: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/147689563
No-op commit succeeds: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/147716838

When it fails, it happens immediately after the Argo workflow starts. I suspect a webhook isn't ready?

I added a kubectl cluster-info dump saved as artifact to better understand what's going on. We should look into it next time the test fails.

Helm setup requires two passes

I'm finding that the helm setup used for the latest release (v0.6.0) takes two passes of the helm install multicluster-scheduler [...] to work properly. On the first pass of eg.

helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --kube-context $CLUSTER2 \
  --set agent.enabled=true \
  --set agent.clusterName=c2

the following error message is received:

Error: Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.233.54.131:443: connect: connection refused

and the multicluster-scheduler-agent-[...] pod is stuck in ContainerCreating state with the following error, even after exporting the remote secret over via ./kubemcsa export --context $CLUSTER1 c1 --as remote | kubectl --context $CLUSTER1 apply -f - and ./kubemcsa export --context $CLUSTER1 c2 --as remote | kubectl --context $CLUSTER2 apply -f -:

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    3m7s                  default-scheduler  Successfully assigned default/multicluster-scheduler-agent-9fddf8bf-85ddk to node3
  Warning  FailedMount  2m59s (x5 over 3m7s)  kubelet, node3     MountVolume.SetUp failed for volume "remote" : secret "remote" not found
  Warning  FailedMount  64s                   kubelet, node3     Unable to mount volumes for pod "multicluster-scheduler-agent-9fddf8bf-85ddk_default(4160c035-35ec-11ea-a853-fa163ec400e7)": timeout expired waiting for volumes to attach or mount for pod "default"/"multicluster-scheduler-agent-9fddf8bf-85ddk". list of unmounted volumes=[cert]. list of unattached volumes=[cert config remote multicluster-scheduler-agent-token-6hpw8]
  Warning  FailedMount  59s (x9 over 3m7s)    kubelet, node3     MountVolume.SetUp failed for volume "cert" : secret "multicluster-scheduler-agent-cert" not found

Things seem to work fine after this point only after multicluster-scheduler is uninstalled and subsequently re-installed with helm:

helm uninstall multicluster-scheduler --kube-context cluster2
helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --kube-context $CLUSTER2 \
  --set agent.enabled=true \
  --set agent.clusterName=c2

Cannot set resources for controller manager deployment with Helm chart

reported by @dimm0

The Helm chart's documentation and default values file suggest resources for the controller manager deployment can be set using the agent.controllerManager.resources value, but the template actually uses the resources value (at the root level). That's a bug.

Route controller taints virtual kubelet nodes

In some Kubernetes distributions, e.g., GKE routes-based clusters, the Kubernetes route controller tries to connect to our virtual-kubelet nodes to create network routes, but can't. It then updates the status of those nodes with a "network unavailable" condition. With the TaintNodesByCondition feature enabled (GA in 1.17, gate removed in 1.18), the condition is transformed into a taint, that prevents pods that don't tolerate it to be scheduled to those nodes.

This affects our proxy pods. Unfortunately, at the moment, the user cannot add local tolerations to proxy pods. Any toleration added to a source pod is saved to be applied instead to the corresponding candidate/delegate pods, and overwritten at admission by hard-coded tolerations (to tolerate the virtual-kubelet.io/provider=admiralty).

Workaround: note that this issue doesn't affect VPC-based clusters in GKE.

An immediate solution is to add a hard-coded toleration.
A long term solution to similar issues is to allow the user to define local scheduling constraints (including custom tolerations) for proxy pods. (TODO: create issue for this item, detailing other benefits.)

cc @verchol

candidate-scheduler and proxy-scheduler: Forbidden: cannot list resource "csinodes"

I'm not able to schedule pods across clusters using v0.8.0.

After deploying v0.8.0 using helm to 3 clusters, I'm seeing the following error in proxy-scheduler and candidate-scheduler:

E0507 13:29:58.249696       1 reflector.go:156] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:108: Failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:admiralty:multicluster-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope

Following online guidance, I tried adding the following lines to the multicluster-scheduler ClusterRole:

- apiGroups:
  - storage.k8s.io
  resources:
  - csinodes
  verbs:
  - watch
  - list
  - get

After applying to the cluster, I now see the following error message (again in proxy-scheduler and candidate-scheduler).

E0507 13:25:51.759485       1 reflector.go:156] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:108: Failed to list *v1.CSINode: the server could not find the requested resource

Environment

Kubernetes Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Cloud: AWS (EKS)
Helm chart/version: v0.8.0

Unable to delete resource objects after uninstallation

Multicluster-scheduler adds finalizers to resources that it doesn't own, for proper garbage collection of remote observation objects. They must be removed when multicluster-scheduler is uninstalled. Otherwise, those objects cannot be deleted.

This should be implemented as a job run as a Helm ~~post-uninstall~~post-delete hook.

In the meantime, you can run the following to properly clean up your system:

echo 'metadata:
  $deleteFromPrimitiveList/finalizers:
    - multicluster.admiralty.io/multiclusterForegroundDeletion' > patch.yaml

for kind in service pod persistentvolumeclaim replicationcontroller replicaset statefulset poddisruptionbudget
do
  kubectl get $kind --all-namespaces -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -n %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name .metadata.namespace }}{{ end }}' | sh
done

for kind in node persistentvolume storageclass
do
  kubectl get $kind -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name }}{{ end }}' | sh
done

Run all controllers in a source cluster, nothing in target clusters

Hi guys,

TL;DR

Three questions:

Is it really okay to not install cert-manager and multicluster-scheduler on the target cluster, and install both only on the source cluster?
Apart from kubectl get pods -n admiralty and kubectl get nodes -o wide, is there any way to ensure all the clusters are setup and connected correctly?
The nginx demo didn't work, always pending, got no virtual pods or delegate pods. How to debug or resolve this issue?

Here's the lengthy context 😢 :

I'm a newcomer to multicluster-scheduler, and I want to deploy Argo Workflows across clusters and avoid to change the target cluster too much.

So I started with multicluster-scheduler installation guide first, followed the instructions twice:

used helm to install cert-manager v0.12.0 and multicluster-scheduler 0.10.0-rc.1 only on the source cluster (as per the doc, the If you can only access one of the two clusters section says it's okay to not install on the target cluster);
installed cert-manager v0.12.0 and multicluster-scheduler 0.10.0-rc.1 on the source and target clusters;

Both attempts looked great after the installation part:

kubectl get nodes -o wide returns two clusters;
kubectl get pods -n admiralty tells all 3 pods on source cluster is running;

Here's the details after installed cert-manager and multicluster-scheduler on both clusters:

❯ export CLUSTER1=dev-earth
❯ export CLUSTER2=dev-moon

❯ kubectl --context "$CLUSTER1" get pods -n admiralty
NAME                                                          READY   STATUS    RESTARTS   AGE
multicluster-scheduler-candidate-scheduler-79f956995c-9kvbz   1/1     Running   0          125m
multicluster-scheduler-controller-manager-6b694fc496-vvhlw    1/1     Running   0          125m
multicluster-scheduler-proxy-scheduler-59bb6c778-45x54        1/1     Running   0          125m

❯ kubectl --context "$CLUSTER1" get nodes -o wide
NAME                               STATUS   ROLES     AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
admiralty-earth                    Ready    cluster   3h5m             <none>          <none>        <unknown>            <unknown>           <unknown>
admiralty-moon                     Ready    cluster   125m             <none>          <none>        <unknown>            <unknown>           <unknown>
host-10-198-21-72-10.198.21.72     Ready    <none>    24d    v1.14.1   10.198.21.72    <none>        Ubuntu 16.04.1 LTS   4.4.0-130-generic   docker://18.9.9
host-10-198-22-176-10.198.22.176   Ready    <none>    24d    v1.14.1   10.198.22.176   <none>        Ubuntu 16.04.1 LTS   4.4.0-130-generic   docker://18.9.9
host-10-198-23-129-10.198.23.129   Ready    master    24d    v1.14.1   10.198.23.129   <none>        Ubuntu 16.04.1 LTS   4.4.0-130-generic   docker://18.9.9

❯ kubectl --context "$CLUSTER2" get pods -n admiralty
NAME                                                          READY   STATUS    RESTARTS   AGE
multicluster-scheduler-candidate-scheduler-6cff566db6-77l9h   1/1     Running   0          59m
multicluster-scheduler-controller-manager-d857466cd-bvxht     1/1     Running   1          59m
multicluster-scheduler-proxy-scheduler-7d8bb6666d-568tc       1/1     Running   0          59m

❯ kubectl --context "$CLUSTER2" get nodes -o wide
NAME                 STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
bj-idc-10-10-18-73   NotReady   <none>   15d   v1.15.9   10.10.18.73   <none>        Ubuntu 16.04.6 LTS   4.15.0-45-generic   docker://18.9.7
bj-idc-10-10-18-74   Ready      master   15d   v1.15.9   10.10.18.74   <none>        Ubuntu 16.04.1 LTS   4.4.0-184-generic   docker://19.3.8

But the nginx demo never works, and I always got 10 pending pods on the source cluster, and no virutal pods, neither any pods on the target cluster:

Here's the kubectl get pods: (I ran the demo three times, and Pending pods turned into everlasting Terminating after the kubectl delete pods nginx-6b676d6776-XXX --grace-period=0 --force forced deletion):

❯ kb get pods
NAME                                      READY   STATUS        RESTARTS   AGE
camera-pose-service-6c79665bbb-lvcsv      1/1     Running       0          3d22h
counter                                   1/1     Running       0          20h
importing-agent-754b64c55b-xhnsx          1/1     Running       0          17d
myapp                                     1/1     Running       0          3d22h
nginx-6b676d6776-25ztl                    0/1     Terminating   0          96m
nginx-6b676d6776-2ph7q                    0/1     Terminating   0          96m
nginx-6b676d6776-6b2qb                    0/1     Terminating   0          28m
nginx-6b676d6776-7bssr                    0/1     Pending       0          14m
nginx-6b676d6776-7jsgt                    0/1     Terminating   0          28m
nginx-6b676d6776-7z5c5                    0/1     Pending       0          14m
nginx-6b676d6776-9zrnm                    0/1     Pending       0          14m
nginx-6b676d6776-bw4ck                    0/1     Pending       0          14m
nginx-6b676d6776-cvdkx                    0/1     Terminating   0          28m
nginx-6b676d6776-ghkwk                    0/1     Terminating   0          28m
nginx-6b676d6776-hv9nz                    0/1     Pending       0          14m
nginx-6b676d6776-kk5ng                    0/1     Terminating   0          96m
nginx-6b676d6776-kwtx5                    0/1     Terminating   0          28m
nginx-6b676d6776-lbqjv                    0/1     Terminating   0          96m
nginx-6b676d6776-mc2nr                    0/1     Terminating   0          28m
nginx-6b676d6776-mnmlr                    0/1     Terminating   0          96m
nginx-6b676d6776-n7p47                    0/1     Pending       0          14m
nginx-6b676d6776-p4pkj                    0/1     Pending       0          14m
nginx-6b676d6776-pql85                    0/1     Terminating   0          96m
nginx-6b676d6776-q89px                    0/1     Terminating   0          96m
nginx-6b676d6776-r7frt                    0/1     Terminating   0          28m
nginx-6b676d6776-rfv9d                    0/1     Terminating   0          96m
nginx-6b676d6776-rwpn5                    0/1     Terminating   0          28m
nginx-6b676d6776-trt64                    0/1     Pending       0          14m
nginx-6b676d6776-txdvs                    0/1     Terminating   0          96m
nginx-6b676d6776-tzb9l                    0/1     Terminating   0          28m
nginx-6b676d6776-v9swj                    0/1     Pending       0          14m
nginx-6b676d6776-xbcx4                    0/1     Terminating   0          96m
nginx-6b676d6776-z9qlr                    0/1     Terminating   0          28m
nginx-6b676d6776-zzjmx                    0/1     Pending       0          14m
pose-wrapper-6446446694-hbvtj             1/1     Running       0          3d22h
sfd-init-minio-bucket-job-svfxq           0/1     Completed     0          23d
simple-feature-db-proxy-6c795967d-d5xkx   1/1     Running       0          3d6h
simple-feature-db-set-0-worker-0          1/1     Running       0          3d6h

kubectl describe pod for a pending pod:

❯ kubectl describe pod nginx-6b676d6776-7bssr
Name:           nginx-6b676d6776-7bssr
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=nginx
                multicluster.admiralty.io/has-finalizer=true
                pod-template-hash=6b676d6776
Annotations:    multicluster.admiralty.io/elect:
                multicluster.admiralty.io/sourcepod-manifest:
                  apiVersion: v1
                  kind: Pod
                  metadata:
                    annotations:
                      multicluster.admiralty.io/elect: ""
                    creationTimestamp: null
                    generateName: nginx-6b676d6776-
                    labels:
                      app: nginx
                      pod-template-hash: 6b676d6776
                    namespace: default
                    ownerReferences:
                    - apiVersion: apps/v1
                      blockOwnerDeletion: true
                      controller: true
                      kind: ReplicaSet
                      name: nginx-6b676d6776
                      uid: 83c74b66-d885-11ea-bff2-fa163e6c431e
                  spec:
                    containers:
                    - image: nginx
                      imagePullPolicy: Always
                      name: nginx
                      ports:
                      - containerPort: 80
                        protocol: TCP
                      resources:
                        requests:
                          cpu: 100m
                          memory: 32Mi
                      terminationMessagePath: /dev/termination-log
                      terminationMessagePolicy: File
                      volumeMounts:
                      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                        name: default-token-m962m
                        readOnly: true
                    dnsPolicy: ClusterFirst
                    enableServiceLinks: true
                    priority: 0
                    restartPolicy: Always
                    schedulerName: default-scheduler
                    securityContext: {}
                    serviceAccount: default
                    serviceAccountName: default
                    terminationGracePeriodSeconds: 30
                    tolerations:
                    - effect: NoExecute
                      key: node.kubernetes.io/not-ready
                      operator: Exists
                      tolerationSeconds: 300
                    - effect: NoExecute
                      key: node.kubernetes.io/unreachable
                      operator: Exists
                      tolerationSeconds: 300
                    volumes:
                    - name: default-token-m962m
                      secret:
                        secretName: default-token-m962m
                  status: {}
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/nginx-6b676d6776
Containers:
  nginx:
    Image:      nginx
    Port:       80/TCP
    Host Port:  0/TCP
    Requests:
      cpu:        100m
      memory:     32Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m962m (ro)
Volumes:
  default-token-m962m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m962m
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/network-unavailable
                 virtual-kubelet.io/provider=admiralty
Events:          <none>

kubectl describe pod for a terminating pod:

❯ kubectl describe pod nginx-6b676d6776-rwpn5
Name:                      nginx-6b676d6776-rwpn5
Namespace:                 default
Priority:                  0
Node:                      <none>
Labels:                    app=nginx
                           multicluster.admiralty.io/has-finalizer=true
                           pod-template-hash=6b676d6776
Annotations:               multicluster.admiralty.io/elect:
                           multicluster.admiralty.io/sourcepod-manifest:
                             apiVersion: v1
                             kind: Pod
                             metadata:
                               annotations:
                                 multicluster.admiralty.io/elect: ""
                               creationTimestamp: null
                               generateName: nginx-6b676d6776-
                               labels:
                                 app: nginx
                                 pod-template-hash: 6b676d6776
                               namespace: default
                               ownerReferences:
                               - apiVersion: apps/v1
                                 blockOwnerDeletion: true
                                 controller: true
                                 kind: ReplicaSet
                                 name: nginx-6b676d6776
                                 uid: 9766036b-d883-11ea-bff2-fa163e6c431e
                             spec:
                               containers:
                               - image: nginx
                                 imagePullPolicy: Always
                                 name: nginx
                                 ports:
                                 - containerPort: 80
                                   protocol: TCP
                                 resources:
                                   requests:
                                     cpu: 100m
                                     memory: 32Mi
                                 terminationMessagePath: /dev/termination-log
                                 terminationMessagePolicy: File
                                 volumeMounts:
                                 - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
                                   name: default-token-m962m
                                   readOnly: true
                               dnsPolicy: ClusterFirst
                               enableServiceLinks: true
                               priority: 0
                               restartPolicy: Always
                               schedulerName: default-scheduler
                               securityContext: {}
                               serviceAccount: default
                               serviceAccountName: default
                               terminationGracePeriodSeconds: 30
                               tolerations:
                               - effect: NoExecute
                                 key: node.kubernetes.io/not-ready
                                 operator: Exists
                                 tolerationSeconds: 300
                               - effect: NoExecute
                                 key: node.kubernetes.io/unreachable
                                 operator: Exists
                                 tolerationSeconds: 300
                               volumes:
                               - name: default-token-m962m
                                 secret:
                                   secretName: default-token-m962m
                             status: {}
Status:                    Terminating (lasts 63m)
Termination Grace Period:  0s
IP:
IPs:                       <none>
Controlled By:             ReplicaSet/nginx-6b676d6776
Containers:
  nginx:
    Image:      nginx
    Port:       80/TCP
    Host Port:  0/TCP
    Requests:
      cpu:        100m
      memory:     32Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m962m (ro)
Volumes:
  default-token-m962m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m962m
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/network-unavailable
                 virtual-kubelet.io/provider=admiralty
Events:          <none>

Btw, kubectl describe node says both node/cluster has no resource, is this expected?

❯ kubectl describe node admiralty-earth
Name:               admiralty-earth
Roles:              cluster
Labels:             alpha.service-controller.kubernetes.io/exclude-balancer=true
                    kubernetes.io/role=cluster
                    type=virtual-kubelet
                    virtual-kubelet.io/provider=admiralty
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Fri, 07 Aug 2020 13:44:47 +0800
Taints:             virtual-kubelet.io/provider=admiralty:NoSchedule
Unschedulable:      false
Conditions:
  Type    Status  LastHeartbeatTime                 LastTransitionTime                Reason  Message
  ----    ------  -----------------                 ------------------                ------  -------
  Ready   True    Fri, 07 Aug 2020 17:27:27 +0800   Fri, 07 Aug 2020 14:44:08 +0800
Addresses:
System Info:
 Machine ID:
 System UUID:
 Boot ID:
 Kernel Version:
 OS Image:
 Operating System:
 Architecture:
 Container Runtime Version:
 Kubelet Version:
 Kube-Proxy Version:
PodCIDR:                     10.244.3.0/24
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:              <none>

❯ kubectl describe node admiralty-moon
Name:               admiralty-moon
Roles:              cluster
Labels:             alpha.service-controller.kubernetes.io/exclude-balancer=true
                    kubernetes.io/role=cluster
                    type=virtual-kubelet
                    virtual-kubelet.io/provider=admiralty
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Fri, 07 Aug 2020 14:44:08 +0800
Taints:             virtual-kubelet.io/provider=admiralty:NoSchedule
Unschedulable:      false
Conditions:
  Type    Status  LastHeartbeatTime                 LastTransitionTime                Reason  Message
  ----    ------  -----------------                 ------------------                ------  -------
  Ready   True    Fri, 07 Aug 2020 17:27:36 +0800   Fri, 07 Aug 2020 14:44:08 +0800
Addresses:
System Info:
 Machine ID:
 System UUID:
 Boot ID:
 Kernel Version:
 OS Image:
 Operating System:
 Architecture:
 Container Runtime Version:
 Kubelet Version:
 Kube-Proxy Version:
PodCIDR:                     10.244.4.0/24
Non-terminated Pods:         (0 in total)
  Namespace                  Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
Events:              <none>

Many thanks!

multicluster-scheduler-agent pod in status ContainerCreating

I checked the details and found the reason that secret "remote" was not found.

Events:
Type Reason Age From Message

Normal Scheduled 46s default-scheduler Successfully assigned ms/multicluster-scheduler-agent-7d65d8b586-jlrbb to node1
Warning FailedMount 14s (x7 over 45s) kubelet, node1 MountVolume.SetUp failed for volume
"remote" : secret "remote" not found

admiraltyio / admiralty Goto Github PK

admiralty's Issues

Summary

How to Reproduction?

Environment

Recommend Projects

Recommend Topics

Recommend Org