admiraltyio / admiralty Goto Github PK
View Code? Open in Web Editor NEWA system of Kubernetes controllers that intelligently schedules workloads across clusters.
Home Page: https://admiralty.io
License: Apache License 2.0
A system of Kubernetes controllers that intelligently schedules workloads across clusters.
Home Page: https://admiralty.io
License: Apache License 2.0
Feature request: propagate the events from pod running remotely to the proxy pod, so that user can monitor all pods without logging in to remote clusters. Also to see if the remote pod is having some kind of error in the local proxy pod.
The way names for controlled objects are formatted can result in collisions and length errors.
For example, virtual nodes for namespaced targets are named using this format: fmt.Sprintf("admiralty-namespace-%s-%s", target.Namespace, target.Name)
, which would result in the same name for target baz
in the foo-bar
namespace and target bar-baz
in the foo
namespace. A common way to fix this is to append a hash of fmt.Sprintf("%s/%s", target.Namespace, target.Name)
at the end, which cannot collide because slashes aren't allowed in namespaces and names.
Furthermore, if the name of the target is very long, the resulting virtual node name could exceed the limit of 253 characters. A common solution is to truncate the result and keep some room for a hash (to avoid prefix collisions).
Source role bindings are also affected.
In a cluster with 120 nodes and 2600 pods, the agent pod is OOM killed before it can send any observation. Increasing the default memory limit fixes the issue, but we need to provide guidance on how much memory is needed as a function of node and pod cardinality.
To guide the user after helm install
.
https://helm.sh/docs/chart_template_guide/notes_files/
List next steps (configure cross-cluster authn, multi-cluster scheduling), and give a link to the docs.
reported by @dimm0
The pod chaperon controller in the target cluster fails with error syncing 'NAMESPACE/POD_NAME': Pod "POD_NAME" is invalid: spec.initContainers[0].volumeMounts[1].name: Not found: "default-token-gnnn6", requeuing
. In turn, the candidate scheduler times out, and the proxy scheduler times out.'
Service account token volume mounts are removed from pod specs when pod chaperons are created, to avoid this issue. Service accounts and their tokens are different in different clusters. Delegate pods use service accounts found in their target clusters.
The bug is that volume mounts aren't removed for init containers.
(⎈ |nautilus:myazdani)➜ ~ kubectl apply -f https://github.com/admiraltyio/multicluster-scheduler/releases/download/v0.2.0/scheduler.yaml
namespace/multicluster-scheduler created
error: error validating "https://github.com/admiraltyio/multicluster-scheduler/releases/download/v0.2.0/scheduler.yaml": error validating data: [ValidationError(CustomResourceDefinition.status): missing required field "conditions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionStatus, ValidationError(CustomResourceDefinition.status): missing required field "storedVersions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1beta1.CustomResourceDefinitionStatus]; if you choose to ignore these errors, turn validation off with --validate=false
k8s v1.13.3
In some Kubernetes distributions, e.g., GKE routes-based clusters, the Kubernetes route controller tries to connect to our virtual-kubelet nodes to create network routes, but can't. It then updates the status of those nodes with a "network unavailable" condition. With the TaintNodesByCondition feature enabled (GA in 1.17, gate removed in 1.18), the condition is transformed into a taint, that prevents pods that don't tolerate it to be scheduled to those nodes.
This affects our proxy pods. Unfortunately, at the moment, the user cannot add local tolerations to proxy pods. Any toleration added to a source pod is saved to be applied instead to the corresponding candidate/delegate pods, and overwritten at admission by hard-coded tolerations (to tolerate the virtual-kubelet.io/provider=admiralty
).
Workaround: note that this issue doesn't affect VPC-based clusters in GKE.
cc @verchol
I checked the details and found the reason that secret "remote" was not found.
Events:
Type Reason Age From Message
Normal Scheduled 46s default-scheduler Successfully assigned ms/multicluster-scheduler-agent-7d65d8b586-jlrbb to node1
Warning FailedMount 14s (x7 over 45s) kubelet, node1 MountVolume.SetUp failed for volume
"remote" : secret "remote" not found
agent logs show
Federation namespace: multifederation
controller.go:160: cannot update *v1alpha1.NodePoolObservation eks-cluster-default in namespace multifederation in scheduler cluster: Operation cannot be fulfilled on nodepoolobservations.multicluster.admiralty.io "eks-cluster-default": the object has been modified; please apply your changes to the latest version and try again
controller.go:161: Could not reconcile Request. Stop working.
I'm trying to write my own scheduler with kubernetes client python to schedule the pod. These pod are already submitted to the cluster and select my own scheduler. So they are pending in the clsuter. I patch annotation
{"multicluster.admiraly.io/clustername":str_cluster,'multicluster.admiralty.io/elect':''}
to the pod and expect the pod is scheduled to <str_cluster>. But it failed to schedule.
Then I check the pod yaml (not complete).
The result of kubectl describe pod nginx-4
is
But the pod status is still pending, and the virtual pod does't appear.
Is multicluster-scheduler able to schedule the pending pod? How can I do with it?
Thanks.
Some Kubernetes distributions enable DNS horizontal autoscaling by default, e.g., a kube-dns-autoscaler deployment running cluster-proportional-autoscaler autoscales the kube-dns deployment in GKE. The scale depends on the number of nodes and the number of CPU cores (a rough indication of the load on system components, assuming those nodes/cores aren't idle).
Here is a common configuration: linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'
.
Our virtual-kubelet nodes have a dummy, very high CPU capacity, 100,000 cores, because the candidate scheduler in the corresponding target clusters actually enforce capacity constraints, based on nodes there. We only need the virtual-kubelet nodes capacities to be equal or greater than the actual capacities of the clusters that they represent. A quick and good-enough solution was to set those numbers very high.
However, with DNS horizontal autoscaling, after multicluster-scheduler is installed, hundreds of kube-dns replicas are added, overloading kube-scheduler and starving the source cluster. This issue can be fixed at two levels:
--nodelabels
option. AKS does that for ACI nodes and EKS for Fargate nodes.cc @verchol
reported by @dimm0
The Helm chart's documentation and default values file suggest resources for the controller manager deployment can be set using the agent.controllerManager.resources
value, but the template actually uses the resources
value (at the root level). That's a bug.
Scripts already exist. Just automate them on pushes and PRs to master with e.g. GitHub Actions.
Upgrade should be smooth: https://cert-manager.io/docs/release-notes/release-notes-1.0/
cc @jjgraham
Currently we only release amd64 images. We've released arm64 images unofficially (with suffixed tags, e.g., 0.8.0-arm64
), but what we should really do is publish a multi-arch manifest, so that the normal version tag, e.g., 0.8.0
would pull an arm64 image on an arm64 node and an amd64 image on an amd64 node.
Theoretically, given all the PodObservations and NodeObservations, multicluster-scheduler should have enough information to decide which node in a target cluster a delegate pod should be bound to. However, for now, it only decides which cluster, and leaves it up to that cluster's default (or custom) single-cluster scheduler to bind the delegate pod to a node.
The current design allows the user to leverage any single-cluster scheduler's advanced scheduling features. However, there is no guarantee that the target cluster will be able to schedule the delegate pod, because the multi-cluster decision didn't account for all of the single-cluster constraints. Also, the time from multi-cluster pod creation to running state could be reduced by combining the two scheduling steps.
Multicluster-scheduler should account for all constraints and directly schedule to a node, to increase reliability and performance. This means that multicluster-controller should use the default Kubernetes scheduler and also kube-batch as libraries.
Note: working with SIG-Scheduling to define common interfaces would help.
Reported by @tghartland, who noted that force deleting works.
Looks like the last status of delegate pods isn't fed back to proxy pods, so even though proxy pods have a deletion timestamp and no more finalizer, their status shows that all containers are still running, so they can't be deleted gracefully.
This is a regression. Pre-v0.6.0 and virtual-kubelet's adoption, proxy pods were deleted by sending a signal to their dummy containers.
Hello,
I tried to deploy admiralty on two GKE clusters.
During virtual kubelet creation on source cluster, I get following error in node description:
Warning FailedToCreateRoute 25m route_controller Could not create route 0d3a65bd-97b9-4959-aa67-99874ab6317e 10.116.2.0/24 for node admiralty-plm-c7ff8eb6-target-cluster-3e3933bad4 after 233.845278ms: instance not found
The node is tainted as following:
Taints: node.kubernetes.io/network-unavailable:NoSchedule
virtual-kubelet.io/provider=admiralty:NoSchedule
The pods deployed in a namespace with a multicluster.admiralty.io/elect: "" annotation set stay in Pending State.
Is GKE a supported target for multicluster scheduler deployment ?
Thanks
Hi guys,
TL;DR
Three questions:
kubectl get pods -n admiralty
and kubectl get nodes -o wide
, is there any way to ensure all the clusters are setup and connected correctly?Here's the lengthy context 😢 :
I'm a newcomer to multicluster-scheduler, and I want to deploy Argo Workflows across clusters and avoid to change the target cluster too much.
So I started with multicluster-scheduler installation guide first, followed the instructions twice:
If you can only access one of the two clusters
section says it's okay to not install on the target cluster);Both attempts looked great after the installation part:
kubectl get nodes -o wide
returns two clusters;kubectl get pods -n admiralty
tells all 3 pods on source cluster is running;Here's the details after installed cert-manager and multicluster-scheduler on both clusters:
❯ export CLUSTER1=dev-earth
❯ export CLUSTER2=dev-moon
❯ kubectl --context "$CLUSTER1" get pods -n admiralty
NAME READY STATUS RESTARTS AGE
multicluster-scheduler-candidate-scheduler-79f956995c-9kvbz 1/1 Running 0 125m
multicluster-scheduler-controller-manager-6b694fc496-vvhlw 1/1 Running 0 125m
multicluster-scheduler-proxy-scheduler-59bb6c778-45x54 1/1 Running 0 125m
❯ kubectl --context "$CLUSTER1" get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
admiralty-earth Ready cluster 3h5m <none> <none> <unknown> <unknown> <unknown>
admiralty-moon Ready cluster 125m <none> <none> <unknown> <unknown> <unknown>
host-10-198-21-72-10.198.21.72 Ready <none> 24d v1.14.1 10.198.21.72 <none> Ubuntu 16.04.1 LTS 4.4.0-130-generic docker://18.9.9
host-10-198-22-176-10.198.22.176 Ready <none> 24d v1.14.1 10.198.22.176 <none> Ubuntu 16.04.1 LTS 4.4.0-130-generic docker://18.9.9
host-10-198-23-129-10.198.23.129 Ready master 24d v1.14.1 10.198.23.129 <none> Ubuntu 16.04.1 LTS 4.4.0-130-generic docker://18.9.9
❯ kubectl --context "$CLUSTER2" get pods -n admiralty
NAME READY STATUS RESTARTS AGE
multicluster-scheduler-candidate-scheduler-6cff566db6-77l9h 1/1 Running 0 59m
multicluster-scheduler-controller-manager-d857466cd-bvxht 1/1 Running 1 59m
multicluster-scheduler-proxy-scheduler-7d8bb6666d-568tc 1/1 Running 0 59m
❯ kubectl --context "$CLUSTER2" get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
bj-idc-10-10-18-73 NotReady <none> 15d v1.15.9 10.10.18.73 <none> Ubuntu 16.04.6 LTS 4.15.0-45-generic docker://18.9.7
bj-idc-10-10-18-74 Ready master 15d v1.15.9 10.10.18.74 <none> Ubuntu 16.04.1 LTS 4.4.0-184-generic docker://19.3.8
But the nginx demo never works, and I always got 10 pending pods on the source cluster, and no virutal pods, neither any pods on the target cluster:
Here's the kubectl get pods
: (I ran the demo three times, and Pending
pods turned into everlasting Terminating
after the kubectl delete pods nginx-6b676d6776-XXX --grace-period=0 --force
forced deletion):
❯ kb get pods
NAME READY STATUS RESTARTS AGE
camera-pose-service-6c79665bbb-lvcsv 1/1 Running 0 3d22h
counter 1/1 Running 0 20h
importing-agent-754b64c55b-xhnsx 1/1 Running 0 17d
myapp 1/1 Running 0 3d22h
nginx-6b676d6776-25ztl 0/1 Terminating 0 96m
nginx-6b676d6776-2ph7q 0/1 Terminating 0 96m
nginx-6b676d6776-6b2qb 0/1 Terminating 0 28m
nginx-6b676d6776-7bssr 0/1 Pending 0 14m
nginx-6b676d6776-7jsgt 0/1 Terminating 0 28m
nginx-6b676d6776-7z5c5 0/1 Pending 0 14m
nginx-6b676d6776-9zrnm 0/1 Pending 0 14m
nginx-6b676d6776-bw4ck 0/1 Pending 0 14m
nginx-6b676d6776-cvdkx 0/1 Terminating 0 28m
nginx-6b676d6776-ghkwk 0/1 Terminating 0 28m
nginx-6b676d6776-hv9nz 0/1 Pending 0 14m
nginx-6b676d6776-kk5ng 0/1 Terminating 0 96m
nginx-6b676d6776-kwtx5 0/1 Terminating 0 28m
nginx-6b676d6776-lbqjv 0/1 Terminating 0 96m
nginx-6b676d6776-mc2nr 0/1 Terminating 0 28m
nginx-6b676d6776-mnmlr 0/1 Terminating 0 96m
nginx-6b676d6776-n7p47 0/1 Pending 0 14m
nginx-6b676d6776-p4pkj 0/1 Pending 0 14m
nginx-6b676d6776-pql85 0/1 Terminating 0 96m
nginx-6b676d6776-q89px 0/1 Terminating 0 96m
nginx-6b676d6776-r7frt 0/1 Terminating 0 28m
nginx-6b676d6776-rfv9d 0/1 Terminating 0 96m
nginx-6b676d6776-rwpn5 0/1 Terminating 0 28m
nginx-6b676d6776-trt64 0/1 Pending 0 14m
nginx-6b676d6776-txdvs 0/1 Terminating 0 96m
nginx-6b676d6776-tzb9l 0/1 Terminating 0 28m
nginx-6b676d6776-v9swj 0/1 Pending 0 14m
nginx-6b676d6776-xbcx4 0/1 Terminating 0 96m
nginx-6b676d6776-z9qlr 0/1 Terminating 0 28m
nginx-6b676d6776-zzjmx 0/1 Pending 0 14m
pose-wrapper-6446446694-hbvtj 1/1 Running 0 3d22h
sfd-init-minio-bucket-job-svfxq 0/1 Completed 0 23d
simple-feature-db-proxy-6c795967d-d5xkx 1/1 Running 0 3d6h
simple-feature-db-set-0-worker-0 1/1 Running 0 3d6h
kubectl describe pod
for a pending pod:
❯ kubectl describe pod nginx-6b676d6776-7bssr
Name: nginx-6b676d6776-7bssr
Namespace: default
Priority: 0
Node: <none>
Labels: app=nginx
multicluster.admiralty.io/has-finalizer=true
pod-template-hash=6b676d6776
Annotations: multicluster.admiralty.io/elect:
multicluster.admiralty.io/sourcepod-manifest:
apiVersion: v1
kind: Pod
metadata:
annotations:
multicluster.admiralty.io/elect: ""
creationTimestamp: null
generateName: nginx-6b676d6776-
labels:
app: nginx
pod-template-hash: 6b676d6776
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-6b676d6776
uid: 83c74b66-d885-11ea-bff2-fa163e6c431e
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources:
requests:
cpu: 100m
memory: 32Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-m962m
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-m962m
secret:
secretName: default-token-m962m
status: {}
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/nginx-6b676d6776
Containers:
nginx:
Image: nginx
Port: 80/TCP
Host Port: 0/TCP
Requests:
cpu: 100m
memory: 32Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m962m (ro)
Volumes:
default-token-m962m:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m962m
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/network-unavailable
virtual-kubelet.io/provider=admiralty
Events: <none>
kubectl describe pod
for a terminating pod:
❯ kubectl describe pod nginx-6b676d6776-rwpn5
Name: nginx-6b676d6776-rwpn5
Namespace: default
Priority: 0
Node: <none>
Labels: app=nginx
multicluster.admiralty.io/has-finalizer=true
pod-template-hash=6b676d6776
Annotations: multicluster.admiralty.io/elect:
multicluster.admiralty.io/sourcepod-manifest:
apiVersion: v1
kind: Pod
metadata:
annotations:
multicluster.admiralty.io/elect: ""
creationTimestamp: null
generateName: nginx-6b676d6776-
labels:
app: nginx
pod-template-hash: 6b676d6776
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-6b676d6776
uid: 9766036b-d883-11ea-bff2-fa163e6c431e
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources:
requests:
cpu: 100m
memory: 32Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-m962m
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-m962m
secret:
secretName: default-token-m962m
status: {}
Status: Terminating (lasts 63m)
Termination Grace Period: 0s
IP:
IPs: <none>
Controlled By: ReplicaSet/nginx-6b676d6776
Containers:
nginx:
Image: nginx
Port: 80/TCP
Host Port: 0/TCP
Requests:
cpu: 100m
memory: 32Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-m962m (ro)
Volumes:
default-token-m962m:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-m962m
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/network-unavailable
virtual-kubelet.io/provider=admiralty
Events: <none>
Btw, kubectl describe node
says both node/cluster has no resource, is this expected?
❯ kubectl describe node admiralty-earth
Name: admiralty-earth
Roles: cluster
Labels: alpha.service-controller.kubernetes.io/exclude-balancer=true
kubernetes.io/role=cluster
type=virtual-kubelet
virtual-kubelet.io/provider=admiralty
Annotations: node.alpha.kubernetes.io/ttl: 0
CreationTimestamp: Fri, 07 Aug 2020 13:44:47 +0800
Taints: virtual-kubelet.io/provider=admiralty:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready True Fri, 07 Aug 2020 17:27:27 +0800 Fri, 07 Aug 2020 14:44:08 +0800
Addresses:
System Info:
Machine ID:
System UUID:
Boot ID:
Kernel Version:
OS Image:
Operating System:
Architecture:
Container Runtime Version:
Kubelet Version:
Kube-Proxy Version:
PodCIDR: 10.244.3.0/24
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
❯ kubectl describe node admiralty-moon
Name: admiralty-moon
Roles: cluster
Labels: alpha.service-controller.kubernetes.io/exclude-balancer=true
kubernetes.io/role=cluster
type=virtual-kubelet
virtual-kubelet.io/provider=admiralty
Annotations: node.alpha.kubernetes.io/ttl: 0
CreationTimestamp: Fri, 07 Aug 2020 14:44:08 +0800
Taints: virtual-kubelet.io/provider=admiralty:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready True Fri, 07 Aug 2020 17:27:36 +0800 Fri, 07 Aug 2020 14:44:08 +0800
Addresses:
System Info:
Machine ID:
System UUID:
Boot ID:
Kernel Version:
OS Image:
Operating System:
Architecture:
Container Runtime Version:
Kubelet Version:
Kube-Proxy Version:
PodCIDR: 10.244.4.0/24
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
Many thanks!
This is a prerequisite for integrating with other projects in the Kubernetes ecosystem that have already upgraded. With client-go's major breaking change, we all need to get on the same page.
Here's what the upgrade involves:
The feedback controller tries to watch remote pod chaperons at the cluster level even if the target is configured to be namespaced, which fails if not allowed. The controller should use a namespaced shared informer factory.
Followed the README.md
for setting up multicluster-scheduler with two kind clusters(cilium-1.8.1 as their CNI, no cluster mesh ), virtual nodes have been created in cluster1, but cilium pods keep at pending status... how to solve this problem?
ps: everingthing is ok when using kind's default cni("kindnetd")
# kubectl --context "$CLUSTER1" get node
NAME STATUS ROLES AGE VERSION
admiralty-c1 Ready cluster 21m
admiralty-c2 Ready cluster 21m
cluster1-control-plane Ready master 85m v1.18.2
cluster1-worker Ready <none> 84m v1.18.2
# kubectl --context "$CLUSTER2" get node
NAME STATUS ROLES AGE VERSION
cluster2-control-plane Ready master 83m v1.18.2
cluster2-worker Ready <none> 83m v1.18.2
# kubectl --context "$CLUSTER1" get node -l virtual-kubelet.io/provider=admiralty
NAME STATUS ROLES AGE VERSION
admiralty-c1 Ready cluster 20m
admiralty-c2 Ready cluster 20m
Pending pods:
# kubectl --context "$CLUSTER1" get pods --all-namespaces -o wide | grep -i pending
kube-system cilium-6mrvl 0/1 Pending 0 22m <none> admiralty-c1 <none> <none>
kube-system cilium-node-init-7jsvq 0/1 Pending 0 22m <none> admiralty-c2 <none> <none>
kube-system cilium-node-init-s89xs 0/1 Pending 0 22m <none> admiralty-c1 <none> <none>
kube-system cilium-vsnnh 0/1 Pending 0 22m <none> admiralty-c2 <none> <none>
# kubectl --context "$CLUSTER1" describe po cilium-6mrvl -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 32m default-scheduler 0/4 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Too many pods, 3 node(s) didn't match node selector.
Normal Scheduled 32m default-scheduler Successfully assigned kube-system/cilium-6mrvl to admiralty-c1
More specifically, they should follow services that follow pods.
One use case is for integration with Admiralty Cloud's DNS-based global server load balancing (GSLB) solution.
Having issue with scheduling pods on any cluster. The delegate pods always get created at the scheduler cluster even with annotation set to the other cluster.
Setup: 2 clusters created on top of bare-metal with cilium running(No mesh cluster yet). c1 is the scheduler. deploy pods on c2 and get delegate pods on c1. Great! but it still delegates to c1 with annotation explicitly set to c2. If I try the other way around: deploy pods on c1 and it also delegates to c1 even with the annotation set.
I want to fix this and at least able to get delegate pods on c2 when I deploy on c1. any suggestions? Thanks!
kubernetes version: 1.6.0
cilium version: 1.6.3
deploy pods on c2
expected: yes
$ kubectl --context c2 get pods
NAME READY STATUS RESTARTS AGE
nginx-69c855d8f5-hp54d 1/1 Running 0 10m
nginx-69c855d8f5-lb578 1/1 Running 0 10m
nginx-69c855d8f5-zb5d6 1/1 Running 0 10m
ubuntu-56978cc9cb-7ql2q 1/1 Running 0 19s
$ kubectl --context c1 get pods
NAME READY STATUS RESTARTS AGE
c2-default-nginx-69c855d8f5-hp54d 1/1 Running 0 10m
c2-default-nginx-69c855d8f5-lb578 1/1 Running 0 10m
c2-default-nginx-69c855d8f5-zb5d6 1/1 Running 0 10m
c2-default-ubuntu-56978cc9cb-7ql2q 1/1 Running 0 71s
deploy pods on c2
expected: no, the delegate pods c1-default-nginx...
should be created at c2
$ kubectl --context c1 get pods --watch
NAME READY STATUS RESTARTS AGE
c1-default-nginx-956d89fb-5w4bb 1/1 Running 0 17s
c1-default-nginx-956d89fb-gsvv6 1/1 Running 0 17s
c1-default-nginx-956d89fb-wx7fh 1/1 Running 0 17s
c2-default-ubuntu-56978cc9cb-7ql2q 1/1 Running 1 46h
nginx-956d89fb-5w4bb 1/1 Running 0 18s
nginx-956d89fb-gsvv6 1/1 Running 0 18s
nginx-956d89fb-wx7fh 1/1 Running 0 17s```
I built clusters and configured them as the document says. The proxy pods works as expected on syncing status. But the logs of the proxies are not. I am not sure if it is in line with expectations.
my environment variables show below:
export CLUSTER1=cluster1-admin@cluster1
export CLUSTER2=cluster2-admin@cluster2
export NAMESPACE=default
I applied a deployment to specify running the pod on my $CLUSTER2:
# remote-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: say
spec:
replicas: 1
selector:
matchLabels:
app: say
template:
metadata:
labels:
app: say
annotations:
multicluster.admiralty.io/elect: ""
multicluster.admiralty.io/clustername: cluster2-admin@cluster2
spec:
containers:
- name: say
image: docker/whalesay:latest
command: [cowsay]
args: ["hello world"]
After it was finish running. I printed the logs of the pod in $CLUSTER2. It showed as expected.
$ kubectl --context $CLUSTER2 logs say-6895d46c5c-pllx4-84wrs
_____________
< hello world >
-------------
\
\
\
## .
## ## ## ==
## ## ## ## ===
/""""""""""""""""___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\______/
But the logs of the pod in $CLUSTER1 is different:
$ kubectl --context $CLUSTER1 logs say-6895d46c5c-pllx4
Error from server: no preferred addresses found; known addresses: []
Service rerouting transforms a selector like:
selector:
a: b
into
selector:
multicluster.admiralty.io/a: b
to match delegate pods instead of proxy pods.
However, when the service is updated, or the original configuration is re-applied by the user, the selector becomes:
selector:
multicluster.admiralty.io/a: b
a: b
Service rerouting should modify the selector again to make it look like the previous snippet. The controller's shouldReroute
function looks at the service's Endpoints object to see if it targets proxy pods, but at that time, it doesn't match anything.
Proposed solution: don't use the Endpoints, list pods directly with a filtered selector (keys not prefixed with our domain), or, when the service is already rerouted (i.e., the selector contains only keys prefixed with our domain), the original selector saved as an annotation.
I'm finding that the helm setup used for the latest release (v0.6.0) takes two passes of the helm install multicluster-scheduler [...]
to work properly. On the first pass of eg.
helm install multicluster-scheduler admiralty/multicluster-scheduler \
--kube-context $CLUSTER2 \
--set agent.enabled=true \
--set agent.clusterName=c2
the following error message is received:
Error: Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.233.54.131:443: connect: connection refused
and the multicluster-scheduler-agent-[...]
pod is stuck in ContainerCreating state with the following error, even after exporting the remote
secret over via ./kubemcsa export --context $CLUSTER1 c1 --as remote | kubectl --context $CLUSTER1 apply -f -
and ./kubemcsa export --context $CLUSTER1 c2 --as remote | kubectl --context $CLUSTER2 apply -f -
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m7s default-scheduler Successfully assigned default/multicluster-scheduler-agent-9fddf8bf-85ddk to node3
Warning FailedMount 2m59s (x5 over 3m7s) kubelet, node3 MountVolume.SetUp failed for volume "remote" : secret "remote" not found
Warning FailedMount 64s kubelet, node3 Unable to mount volumes for pod "multicluster-scheduler-agent-9fddf8bf-85ddk_default(4160c035-35ec-11ea-a853-fa163ec400e7)": timeout expired waiting for volumes to attach or mount for pod "default"/"multicluster-scheduler-agent-9fddf8bf-85ddk". list of unmounted volumes=[cert]. list of unattached volumes=[cert config remote multicluster-scheduler-agent-token-6hpw8]
Warning FailedMount 59s (x9 over 3m7s) kubelet, node3 MountVolume.SetUp failed for volume "cert" : secret "multicluster-scheduler-agent-cert" not found
Things seem to work fine after this point only after multicluster-scheduler
is uninstalled and subsequently re-installed with helm:
helm uninstall multicluster-scheduler --kube-context cluster2
helm install multicluster-scheduler admiralty/multicluster-scheduler \
--kube-context $CLUSTER2 \
--set agent.enabled=true \
--set agent.clusterName=c2
If one target is misconfigured, e.g., if the referenced kubeconfig secret doesn't exist, the controller manager and proxy scheduler crash (and restart, until configuration is fully ready). This was by design, following the "fail fast, not silently" pattern. It turns out that users would rather have a partially working system—understandably, especially when a misconfigured target is added to a set of properly configured targets.
So we should skip a target if it is misconfigured, but load it as soon as it is properly configured. This means we need to watch for kubeconfig secret changes in the restarter.
Multicluster-scheduler adds finalizers to resources that it doesn't own, for proper garbage collection of remote observation objects. They must be removed when multicluster-scheduler is uninstalled. Otherwise, those objects cannot be deleted.
This should be implemented as a job run as a Helm post-uninstallpost-delete hook.
In the meantime, you can run the following to properly clean up your system:
echo 'metadata:
$deleteFromPrimitiveList/finalizers:
- multicluster.admiralty.io/multiclusterForegroundDeletion' > patch.yaml
for kind in service pod persistentvolumeclaim replicationcontroller replicaset statefulset poddisruptionbudget
do
kubectl get $kind --all-namespaces -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -n %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name .metadata.namespace }}{{ end }}' | sh
done
for kind in node persistentvolume storageclass
do
kubectl get $kind -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name }}{{ end }}' | sh
done
If pod is annotated with multicluster.admiralty.io/elect=""
in a namespace not labeled with multicluster-scheduler=enabled
, it gets scheduled to a local node. The annotation is what some controllers look at to filter proxy pods. In this case, they erroneously consider the normal pod to be a proxy pod. Then, they use the node name as a key to find a client for a target cluster, which doesn't exist...
If a delegate Pod is deleted, e.g., evicted, multicluster-scheduler's agent creates a new one, based on the owning PodDecision in the scheduler cluster. However, if the scheduler cluster is unavailable (common in poor connectivity use cases), nothing will replaces the deleted Pod until connectivity is reestablished.
Therefore, I propose a new custom resource, Delegates: an intermediary between PodDecisions and delegate Pods, colocated with delegate Pods and owning/controlling them.
I'm not able to schedule pods across clusters using v0.8.0.
After deploying v0.8.0 using helm to 3 clusters, I'm seeing the following error in proxy-scheduler and candidate-scheduler:
E0507 13:29:58.249696 1 reflector.go:156] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:108: Failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:serviceaccount:admiralty:multicluster-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
Following online guidance, I tried adding the following lines to the multicluster-scheduler ClusterRole:
- apiGroups:
- storage.k8s.io
resources:
- csinodes
verbs:
- watch
- list
- get
After applying to the cluster, I now see the following error message (again in proxy-scheduler and candidate-scheduler).
E0507 13:25:51.759485 1 reflector.go:156] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:108: Failed to list *v1.CSINode: the server could not find the requested resource
Kubernetes Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Cloud: AWS (EKS)
Helm chart/version: v0.8.0
Jobs scheduled with the multicluster.admiralty.io/elect: ""
never reach completion.
As an example, I set up the multicluster scheduler on two openstack nodes following the instructions in the README, then scheduled the attached sample job taken from the kubernetes docs with the multicluster.admiralty.io/elect: ""
annotation. The job should finish in ~10s, but is still running after 12 hours:
[root@node1 ~]# kubectl get pods | grep pi
pi-6hkw9 1/1 Running 0 12h
The proxy pod is rapidly scheduled by its local cluster's scheduler (the default one or the one defined in the spec). This is misleading to the user, because the delegate pod may take longer to start in its target cluster (or stay pending if it's unschedulable).
The pod admission webhook should replace the scheduler name in the proxy pod's spec, so the agent can act as its single-cluster scheduler and only bind it to a node when the corresponding delegate pod is bound to a node in the target cluster.
It appears that all config maps and secrets in all namespaces are "Successfully synced ..." even though nothing was done for most of them. (Only the config maps and secrets referred to by proxy pods are copied to target clusters.)
The message in question is an info log from the processNextWorkItem function of the controller package. If Handle is successful and doesn't require requeuing, the "Successfully synced" message is printed. The follow controllers, in particular, enqueue all config maps and secrets and filter them in their respective Handle functions, called from processNextWorkItem. We could filter config maps and secrets in event handlers, so processNextWorkItem is only called for relevant config maps and secrets. We could also just remove the confusing info log (not super useful).
Reported by @tghartland, who noted that the agent pod cannot be force deleted either.
The agent pod has a pod observation itself. Therefore, it has a finalizer (for cross-cluster garbage collection). The deployment's strategy is "recreate", meaning there can only be one agent running. When the agent pod is deleted (before a new one can be started according to the strategy), the finalizer isn't removed, because that would have been the agent's responsibility... The "recreate" strategy may not be necessary though, so a new agent could remove the finalizer of its predecessor. If there cannot be two concurrent agents (esp. for virtual-kubelet), we need leader election, except for the finalizer cleaning part... A hacky solution could also be to not observe the agent. But anyway, observations and decisions may not exist in the next version.
Multicluster-scheduler should only add its cross-cluster garbage collection finalizers to secrets and config maps that are copied to other clusters.
At least, it should only add finalizers to those resources in labeled namespaces.
Source cluster role references in the source controller are hard-coded as multicluster-scheduler-source
and multicluster-scheduler-cluster-summary-viewer
, which are their names if the Helm release is named multicluster-scheduler
, but may be different otherwise.
We need to pass those names from the Helm chart to the controller, e.g., as environment variables.
The feedback controller relies on the multicluster.admiralty.io/parent-name
label value to reconcile remote pod chaperons with proxy pods (because their names differ). However, label values are limited to 63 characters, and pod names can be as long as 253 characters. Therefore, candidate pod chaperon creations, which include the needed label, fail for proxy pods whose names are longer than 63 characters.
We need to use annotations instead of labels for remote controller references.
To select controlled objects given a parent object, we can still use the parent UID label, which is short enough.
Note: We had fixed this before in multicluster-controller, but the feedback controller doesn't use it.
This is a regression introduced in v0.8.2. The patch isn't formatted right, missing quotes: https://github.com/admiraltyio/admiralty/blame/75cf62603a6f6417bbe2aca561bcaa89298aac7e/cmd/remove-finalizers/main.go#L34
In the meantime, run this instead to remove finalizers (the hook's job, as a script):
echo 'metadata:
$deleteFromPrimitiveList/finalizers:
- multicluster.admiralty.io/multiclusterForegroundDeletion' > patch.yaml
for kind in service pod configmap secret
do
kubectl get $kind --all-namespaces -o go-template='{{range .items}}{{ printf "kubectl patch %s %s -n %s -p \"$(cat patch.yaml)\"\n" .kind .metadata.name .metadata.namespace }}{{ end }}' | sh
done
Given a source-target cluster connection, if the target cluster supports multiple huge page sizes (feature gate HugePageStorageMediumSize enabled), but the source cluster doesn't—in particular, if the target cluster runs Kubernetes 1.19 (feature gate enabled by default) and the source cluster runs 1.18 or lower—the cluster summary in the target cluster will include aggregated capacity and allocatable resources for multiple huge page sizes, but when the upstream resource controller tries to apply those values to the virtual node corresponding to the target cluster, kube-apiserver rejects the status update. The virtual node remains with zero capacity and allocatable resources. Therefore, pods can't be scheduled to the target clusters because the virtual node doesn't pass the standard resource filter in the proxy scheduler.
Even though it famously "works on my machine":tm:!
PR run succeeds: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/146754445
Merge commit fails: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/147689563
No-op commit succeeds: https://github.com/admiraltyio/multicluster-scheduler/actions/runs/147716838
When it fails, it happens immediately after the Argo workflow starts. I suspect a webhook isn't ready?
I added a kubectl cluster-info dump
saved as artifact to better understand what's going on. We should look into it next time the test fails.
Consider leveraging Kubemark.
The proxy pod scheduler's filter plugin creates candidate pods in target clusters and waits for those pods to be reserved or unschedulable, and filters the corresponding virtual node accordingly. If it is not allowed to create a candidate in a target cluster, it knows to filter out the corresponding virtual node. But if it receives any other error, including if the namespace doesn't exist, the entire scheduling cycle fails, when only that virtual node should be filtered out.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.