cybozu-go / coil Goto Github PK
View Code? Open in Web Editor NEWCNI plugin for Kubernetes designed for scalability and extensibility
License: Apache License 2.0
CNI plugin for Kubernetes designed for scalability and extensibility
License: Apache License 2.0
Currently, the destinations
field of Egress resource is not editable.
Users need to recreate the Egress resource if they want to modify it, for example, when they want to add another destination.
Allow users to edit Egress destinations
.
This means Coil should reconfigure NAT rules in the Pod network namespace
and would incur a significant change in coild
.
Describe the bug
When applying an example addresspool pool, it just shows the webhook is a 404
Error from server (InternalError): error when creating "0101_addresspool.yaml": Internal error occurred: failed calling webhook "maddresspool.kb.io": the server could not find the requested resource
Environments
To Reproduce
Steps to reproduce the behavior:
apiVersion: coil.cybozu.com/v2
kind: AddressPool
metadata:
name: default
spec:
blockSizeBits: 0
subnets:
- ipv4: 192.168.0.0/22
Expected behavior
A new addresspool
resource should be created so I can use the addressblock
resource (I think) within my egress
resource so the pod can run with an external address from the pool. (Otherwise my egress SNAT pod gets a CNI error about "network: failed to allocate address")
Additional context
my coil.yaml
images:
- name: coil
newTag: 2.0.9
newName: ghcr.io/cybozu-go/coil
resources:
- config/default
# If you are using CKE (github.com/cybozu-go/cke) and wwant to use
# its webhook installation feature, comment the above line and
# uncomment the below line.
#- config/cke
# If you want to enable coil-router, uncomment the following line.
# Note that coil-router can work only for clusters where all the
# nodes are in a flat L2 network.
- config/pod/coil-router.yaml
# If your cluster has enabled PodSecurityPolicy, uncomment the
# following line.
#- config/default/pod_security_policy.yaml
patchesStrategicMerge:
# Uncomment the following if you want to run Coil with Calico network policy.
- config/pod/compat_calico.yaml
# Edit netconf.json to customize CNI configurations
configMapGenerator:
- name: coil-config
namespace: system
files:
- cni_netconf=./netconf.json
# Adds namespace to all resources.
namespace: kube-system
# Labels to add to all resources and selectors.
commonLabels:
app.kubernetes.io/name: coil
Describe the bug
When updating the status of a BlockRequest fails (in our case due to a flaky API server), this can lead to multiple Blocks being created from a single request. In our case this lead to completely filling up all available Pools.
Environments
To Reproduce
Tricky, since it's on the error path.
Expected behavior
A single BlockRequest should always only yield a single Block.
Suggestion
In general, status shouldn't be used as a basis for reconciliation decisions, unless it was computed in the current iteration. Controllers should always look at the actual state instead.
I suggest to either derive the Block name from the request in a deterministic way, or add a label to the Block with the UID (or name) of the BlockRequest it was created from. This way the controller can simply query the API for the block it would create, and abort if it already exists.
This way it no longer relies on a correct and up to date status of the BlockRequest and is more resilient to errors.
Define, gather, and export important metrics of Coil v2 for Prometheus.
Describe how to address the issue.
Describe the bug
A Cluster IP did not assign to a pod.
The pod remained ContainerCreating
status.
$ kubectl get pod -n logging ingester-0 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingester-0 0/1 ContainerCreating 0 5h28m <none> 10.69.1.132 <none> <none>
$ kubectl describe pod -n logging ingester-0
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 2m33s (x78 over 92m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f4036648c160210acd505f6419a0c02b578454ac25762391be76b309a21c8716": failed to allocate address; aborting new block request: context deadline exceeded
A coil-controller logged the following messages.
It seems that the creation of an AddressBlock was failed.
cybozu@gcp0-boot-0:~$ stern -n kube-system coil-controller-
(...)
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767715.798942,"logger":"blockrequest-reconciler","msg":"internal error","blockrequest":"req-default-10.69.1.132","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767715.7990007,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"coil.cybozu.com","reconcilerKind":"BlockRequest","controller":"blockrequest","name":"req-default-10.69.1.132","namespace":"","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767716.9866533,"logger":"pool-manager","msg":"failed to create AddressBlock","pool":"default","index":7,"node":"10.69.0.4","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767716.9867475,"logger":"blockrequest-reconciler","msg":"internal error","blockrequest":"req-default-10.69.0.4","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767716.9867713,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"coil.cybozu.com","reconcilerKind":"BlockRequest","controller":"blockrequest","name":"req-default-10.69.0.4","namespace":"","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767726.7400553,"logger":"pool-manager","msg":"failed to create AddressBlock","pool":"default","index":7,"node":"10.69.1.132","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767726.7401295,"logger":"blockrequest-reconciler","msg":"internal error","blockrequest":"req-default-10.69.1.132","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767726.7401562,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"coil.cybozu.com","reconcilerKind":"BlockRequest","controller":"blockrequest","name":"req-default-10.69.1.132","namespace":"","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767727.8927827,"logger":"pool-manager","msg":"failed to create AddressBlock","pool":"default","index":7,"node":"10.69.0.4","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767727.892883,"logger":"blockrequest-reconciler","msg":"internal error","blockrequest":"req-default-10.69.0.4","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
coil-controller-cc56ff6f-sw5rq coil-controller {"level":"error","ts":1617767727.892939,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"coil.cybozu.com","reconcilerKind":"BlockRequest","controller":"blockrequest","name":"req-default-10.69.0.4","namespace":"","error":"addressblocks.coil.cybozu.com \"default-7\" already exists"}
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Support Kubernetes 1.27.
Previous Pull Request:
Note: controller-runtime v0.15.0 is the largest in the history of the project.
https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.15.0
Implement Coil v2 features:
Completely rewrite the code under v2/
directory.
All codes from v1 are moved under v1/
directory.
Describe the bug
Coil-egress accidentally deletes a peer and a client pod that still holds a FOU tunnel with the nat pod can't communicate over the nat because of this.
Environments
To Reproduce
We don't know yet.
Expected behavior
Coil-egress doesn't delete a peer that still or recover the peer if this happens.
Additional context
Add logs to coil-egress so that we can investigate what is going on if this issue occurs.
Describe the bug
I tried to restart the Egress NAT Deployment by kubectl rollout restart
.
But the NAT Deployment does not restart.
$ kubectl rollout restart deploy test-nat
$ kubectl get pod -w
NAME READY STATUS RESTARTS AGE
test-nat-5f67c6947d-54rjk 1/1 Running 0 108s
test-nat-5f67c6947d-szxst 1/1 Running 0 108s
test-nat-7f9df9488b-pjvbg 0/1 Pending 0 0s // The new pod is created.
test-nat-7f9df9488b-pjvbg 0/1 Pending 0 0s
test-nat-7f9df9488b-pjvbg 0/1 ContainerCreating 0 0s
test-nat-7f9df9488b-pjvbg 0/1 Terminating 0 0s // The new pod is terminated immediately.
test-nat-7f9df9488b-pjvbg 1/1 Terminating 0 1s
test-nat-7f9df9488b-pjvbg 0/1 Terminating 0 32s
test-nat-7f9df9488b-pjvbg 0/1 Terminating 0 32s
test-nat-7f9df9488b-pjvbg 0/1 Terminating 0 32s
Environments
To Reproduce
kubectl rollout restart deploy <NAT_Deployment>
Expected behavior
The old NAT pods be terminated. And the new pods become running.
Additional context
kubectl rollout restart deployment
adds a kubectl.kubernetes.io/restartedAt
annotation to the pod template of the Deployment for creating the new ReplicaSet.
But the egress controller will overwirte the annotation.
https://github.com/cybozu-go/coil/blob/v2.0.14/v2/controllers/egress_controller.go#L118
When we update Coil, we have downtime in the coil-egress due to the timing of updating coild and coil controller.
When we update coil controller, deployment resources tied to Egress restart due to updating the image.
At this time, the Pod changes to the terminating state.
But, when coild hasn't started from the restart, kubelet can't delete the Pod, and kubelet has to wait to start coild.
So, even though the container process of the Pod is deleted, Cilium still recognizes that the Pod can handle the existing traffic in this case.
This causes downtime on coil egress.
We have to control the order of the update of coild and coil controller; the first is coild, and the second is coil controller.
TODO
Describe the bug
When I install an additional cni e.g. linkderd-cni.
To accomplish more complex Network functionality.
I get race conditions between the 2 cnis.
Either both work or the other cni is broken.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Both CNI plugins should coexist.
Coil should check if the cni config got updated and only change the parts which are referenced in the initial configmap.
It should also subscribe to file changes and do the check accordingly.
It should also imply to be the first plugin in the list and only update the parts from the initial configmap.
Additional context
For this to be possible there need to be a fundamental change in how the cni configuration gets applied on initialization.
So that coil is not bricking other cnis.
We currently need it to accomplish cross region encrypted connectivity between clusters and don't want to use the sidecar approach for it so we looked into a cni approach for a servicemesh and we discovered this race condition bug.
Describe the bug
Removing addressblocks for a drained node results in an infinite wait time.
We had a node PSU failure which resulted in the sudden (and longer than a few days) downtime. Meanwhile we wanted the NAT service to re-schedule on another node. This failed because the addressblock was never freed by Coil. (as the NAT consists of a /32 public IP, there were no spare addresses).
I expected that when draining a node, all addressblocks assigned to that node would be removed. (Please note: the node in question did not work anymore due to a failed PSU, so the controller evicted all pods with --force and --grace-period=0, because otherwise the draining would hang on the egress NAT deployment.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
addressblock gets cleared and egress pod reschedules on another node.
Use node InternalIP for a host-side veth IP so that Cilium recognizes a source IP from the node as localhost address.
Retrieve node's InternalIP address from a node resource and set it to a host-side veth
This code is no longer needed.
This is for v1 migration.
Remove it.
coil/v2/runners/coild_server.go
Lines 257 to 266 in b67d97e
Currently, Coil waits for the 30s before destroying the pod network in the CNI delete operation. Why Coil is doing so is to keep connectivity for network components that need time to gracefully shut down active TCP connections. For example, Envoy waits for TCP connections to drain in its preStop hook before shutting down. The CNI delete is called as soon as the pod has the deletion timestamp, and destroying the pod network disrupts connections to Envoy and breaks the graceful shutdown assumption.
However, this implementation forces all pods including those that don't need such delay to wait in the CNI delete operation. For instance, the nat pods derived from coil Egress which only receive connection-less UDP packets have to wait for the 30s even though it's not necessary.
EDIT:
We later found out that k8s calls the StopSandbox API of the container runtime after killing container processes. So coil doesn't need to sleep in its delete implementation.
https://github.com/kubernetes/kubernetes/blob/02f9b2240814d2e952eaf7dca3a665a675004f21/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L979
related to #164
Options
Rolling restart of Egress NAT pods causes a brief outage
Cilium selects a new backend if the client hits the same old tuple for syn packets, but it doesn't consider UDP packets. So we need to send a PR for it.
cilium/cilium#20407
It addresses the possibility to ask questions of implementation specifics for coil.
It is handy to configure the MTU of veth links automatically.
In coild
,
Device
links with up
status: https://pkg.go.dev/github.com/vishvananda/netlink#DeviceDescribe the bug
There is currently only amd64 containers, which will not work on arm64 nodes.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Expected it to work on ARM64
Additional context
$ sudo docker inspect ghcr.io/cybozu-go/coil:2.0.13
[
{
"Id": "sha256:60305a706525ca0c5a29d3ffae25755a6e8e0f9fa7996a5f1a2b3fb3a17ae282",
"RepoTags": [
"ghcr.io/cybozu-go/coil:2.0.13"
],
"RepoDigests": [
"ghcr.io/cybozu-go/coil@sha256:8133e128835f7c05f1ca1fd900eaa37fa17640f520f07330fd20a76a01b48dc4"
],
"Parent": "",
"Comment": "",
"Created": "2021-10-26T02:38:28.511583682Z",
"Container": "ded8846ab6f07808c9f3069eb0871f61f234841a68d9f1c36771567337853036",
"ContainerConfig": {
"Hostname": "ded8846ab6f0",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/coil:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"/bin/sh",
"-c",
"#(nop) ",
"ENV PATH=/usr/local/coil:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Image": "sha256:413e977f9618f6e5e8c36f5b96d3515dee85a274f2479e2934ae25c5c826cbee",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": {
"org.opencontainers.image.source": "https://github.com/cybozu-go/coil"
}
},
"DockerVersion": "20.10.9+azure-1",
"Author": "",
"Config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/coil:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"/bin/bash"
],
"Image": "sha256:413e977f9618f6e5e8c36f5b96d3515dee85a274f2479e2934ae25c5c826cbee",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": {
"org.opencontainers.image.source": "https://github.com/cybozu-go/coil"
}
},
"Architecture": "amd64", # THIS RIGHT HERE
"Os": "linux",
"Size": 268482846,
"VirtualSize": 268482846,
"GraphDriver": {
"Data": null,
"Name": "btrfs"
},
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:da55b45d310bb8096103c29ff01038a6d6af74e14e3b67d1cd488c3ab03f5f0d",
"sha256:686277369035b640c16f04679389a897bf76539a025db4663e4993004de3ee36",
"sha256:4717130c466b131a283c53643dcd50f79803c872793fa5b1a0b012262470194a",
"sha256:d4a394709e4d956943dec2da5d0f16160e99a168c20089b3ca6de1ca0f15d48a",
"sha256:bd20f117fab2f9489b962226d5bdf266277fc0cc3c2c816b2b7d887cc28dc805",
"sha256:2d53babe35538bb43e8e7209e7e73a4f700480f4b2e734bd9bacc80cb205a0b8"
]
},
"Metadata": {
"LastTagTime": "0001-01-01T00:00:00Z"
}
}
]
Support Kubernetes 1.23.
Previous Pull Request:
Describe the bug
Unable to delete AddressPool
Environments
To Reproduce
Create an AddressPool and Delete it.
$ kubectl apply -f <manifest file of AddressPool>
$ kubectl delete addresspool <AddressPool Name>
Then coil-controller outputs the following error.
{
"level": "error",
"ts": "2023-12-12T02:04:11Z",
"msg": "Reconciler error",
"controller": "addresspool",
"controllerGroup": "coil.cybozu.com",
"controllerKind": "AddressPool",
"AddressPool": {
"name": "test"
},
"namespace": "",
"name": "test",
"reconcileID": "6f2ae24d-86de-4ec0-b52c-96ded1c22ae6",
"error": "failed to remove finalizer from address pool: addresspools.coil.cybozu.com \"test\" is forbidden: User \"system:serviceaccount:kube-system:coil-controller\" cannot update resource \"addresspools\" in API group \"coil.cybozu.com\" at the cluster scope",
"stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226"
}
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
This bug could be related to the following issue.
vishvananda/netlink#576
coil: ghcr.io/cybozu-go/coil:2.0.14
Kubernetes Version: 1.21.11
Container Runtime: cri-o 1.21.6
Linux OS: AlmaLinux 4.18.0-348.20.1.el8_5.x86_64
apiVersion: coil.cybozu.com/v2
kind: AddressPool
metadata:
name: lb
spec:
blockSizeBits: 6
subnets:
- ipv4: 100.126.16.0/20
ipv6: 2001:7c7:2100:42f:ffff:ffff:ffff:f000/116
Namespace with requirement for DualStack Pods
apiVersion: v1
kind: Namespace
metadata:
annotations:
coil.cybozu.com/pool: lb
name: gitlab-runner
resourceVersion: "407513963"
uid: a698674e-6c56-4728-bfdf-952bd2248433
spec:
finalizers:
- kubernetes
status:
phase: Active
Warning FailedCreatePodSandBox 13s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runner-zaevnzet-project-392-concurrent-0cdzrw_gitlab-runner_abfc57ce-1626-444f-a0f0-635ac3ce9e27_0(372c1a10840209f43cb46fce286f263d0a25da436c6c2157a86517d590f730a4): error adding pod gitlab-runner_runner-zaevnzet-project-392-concurrent-0cdzrw to CNI network "k8s-pod-network": failed to setup pod network; netlink: failed to add a hostIPv4 address: numerical result out of range
Mar 18 22:56:18 wuerfelchen-w-3 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Mar 18 22:56:18 wuerfelchen-w-3 kernel: IPv6: ADDRCONF(NETDEV_UP): veth88f11a25: link is not ready
Mar 18 22:56:18 wuerfelchen-w-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth88f11a25: link becomes ready
Mar 18 22:56:18 wuerfelchen-w-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 18 22:56:20 wuerfelchen-w-3 kernel: netlink: 'coild': attribute type 2 has an invalid length.
The code lines which could be the issue are here.
Lines 289 to 312 in 1db9b2d
{"level":"info","ts":1647643830.4085627,"logger":"node-ipam","msg":"requesting a new block","pool":"lb"}
{"level":"info","ts":1647643830.4251256,"logger":"node-ipam","msg":"waiting for request completion","pool":"lb"}
{"level":"info","ts":1647643830.4395173,"logger":"node-ipam","msg":"adding a new block","pool":"lb","name":"lb-0","block-pool":"lb","block-node":"wuerfelchen-w-3"}
{"level":"info","ts":1647643830.4395475,"logger":"node-ipam","msg":"allocated","pool":"lb","block":"lb-0","ipv4":"100.126.16.0","ipv6":"2001:7c7:2100:42f:ffff:ffff:ffff:f000"}
{"level":"info","ts":1647643830.4438093,"logger":"route-exporter","msg":"synchronizing routing table","table-id":119}
{"level":"info","ts":1647643831.7375743,"logger":"node-ipam","msg":"freeing an empty block","pool":"lb","block":"lb-0"}
{"level":"info","ts":1647643831.768794,"logger":"route-exporter","msg":"synchronizing routing table","table-id":119}
{"level":"error","ts":1647643831.7710552,"logger":"grpc","msg":"failed to setup pod network","grpc.start_time":"2022-03-18T22:50:30Z","grpc.request.deadline":"2022-03-18T22:51:30Z","system":"grpc","span.kind":"server","grpc.service":"pkg.cnirpc.CNI","grpc.method":"Add","grpc.request.pod.namespace":"gitlab-runner","grpc.request.netns":"/var/run/netns/fe9df582-47b9-4b13-844d-88165d87ab7f","grpc.request.ifname":"eth0","grpc.request.container_id":"665dbc5ad3ec9ff5d19f55d860f52939694eaaf48c314a80c4ebe84b3b0a10b8","peer.address":"@","grpc.request.pod.name":"debug-pod","error":"netlink: failed to add a hostIPv4 address: numerical result out of range"}
{"level":"error","ts":1647643831.7711568,"logger":"grpc","msg":"finished unary call with code Internal","grpc.start_time":"2022-03-18T22:50:30Z","grpc.request.deadline":"2022-03-18T22:51:30Z","system":"grpc","span.kind":"server","grpc.service":"pkg.cnirpc.CNI","grpc.method":"Add","grpc.request.pod.name":"debug-pod","grpc.request.pod.namespace":"gitlab-runner","grpc.request.netns":"/var/run/netns/fe9df582-47b9-4b13-844d-88165d87ab7f","grpc.request.ifname":"eth0","grpc.request.container_id":"665dbc5ad3ec9ff5d19f55d860f52939694eaaf48c314a80c4ebe84b3b0a10b8","peer.address":"@","error":"rpc error: code = Internal desc = failed to setup pod network","grpc.code":"Internal","grpc.time_ms":1368.211}
{"level":"info","ts":1647643831.7817273,"logger":"grpc","msg":"waiting before destroying pod network","grpc.start_time":"2022-03-18T22:50:31Z","grpc.request.deadline":"2022-03-18T22:51:31Z","system":"grpc","span.kind":"server","grpc.service":"pkg.cnirpc.CNI","grpc.method":"Del","peer.address":"@","grpc.request.pod.name":"debug-pod","grpc.request.pod.namespace":"gitlab-runner","grpc.request.netns":"/var/run/netns/fe9df582-47b9-4b13-844d-88165d87ab7f","grpc.request.ifname":"eth0","grpc.request.container_id":"665dbc5ad3ec9ff5d19f55d860f52939694eaaf48c314a80c4ebe84b3b0a10b8","duration":"30s"}
{"level":"error","ts":1647643861.784251,"logger":"grpc","msg":"intentionally ignoring error for v1 migration","grpc.start_time":"2022-03-18T22:50:31Z","grpc.request.deadline":"2022-03-18T22:51:31Z","system":"grpc","span.kind":"server","grpc.service":"pkg.cnirpc.CNI","grpc.method":"Del","peer.address":"@","grpc.request.pod.name":"debug-pod","grpc.request.pod.namespace":"gitlab-runner","grpc.request.netns":"/var/run/netns/fe9df582-47b9-4b13-844d-88165d87ab7f","grpc.request.ifname":"eth0","grpc.request.container_id":"665dbc5ad3ec9ff5d19f55d860f52939694eaaf48c314a80c4ebe84b3b0a10b8","error":"link not found"}
{"level":"info","ts":1647643861.7843103,"logger":"grpc","msg":"finished unary call with code OK","grpc.start_time":"2022-03-18T22:50:31Z","grpc.request.deadline":"2022-03-18T22:51:31Z","system":"grpc","span.kind":"server","grpc.service":"pkg.cnirpc.CNI","grpc.method":"Del","peer.address":"@","grpc.request.pod.name":"debug-pod","grpc.request.pod.namespace":"gitlab-runner","grpc.request.netns":"/var/run/netns/fe9df582-47b9-4b13-844d-88165d87ab7f","grpc.request.ifname":"eth0","grpc.request.container_id":"665dbc5ad3ec9ff5d19f55d860f52939694eaaf48c314a80c4ebe84b3b0a10b8","grpc.code":"OK","grpc.time_ms":30002.637}
Describe the bug
A Pod failed to start due to the following CNI error:
$ kubectl describe pod ...
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 2m14s (x6051 over 21h) kubelet, 10.69.0.21 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": failed to move veth to host netns: invalid argument
This indicates that coil failed to remove existing veth in some cases.
Note that veth names generated by coil are not random.
They are generated as same as Calico to work with Calico NetworkPolicy implementation.
Environments
To Reproduce
Steps to reproduce the behavior:
$ ip l
.Additional context
You can get the veth name from the pod name by the following code.
package main
import (
"crypto/sha1"
"encoding/hex"
"fmt"
)
func main() {
h := sha1.New()
h.Write([]byte(fmt.Sprintf("%s.%s", "topolvm-system", "csi-topolvm-node-tnjfp")))
fmt.Printf("%s%s", "veth", hex.EncodeToString(h.Sum(nil))[:11])
}
Implement data migration from v1.
The migration should be able to run online.
That is, it must keep the running Pods alive.
It is nice to send back ICMP host unreachable when a client sends packets to a non-existing Pod.
Describe how to address the issue.
Hello, this is not an issue, but I could not find any other way to contact you therefore writing here.
The situation is as follows: I have a cluster already up and running which uses Flannel. That is why I don't know whether it is possible to install Coil over that setup.
Taking into account that Flannel is a L3 network plugin, I think that simply uninstalling driver won't work because I will untie the entire cluster network...
Am I right? If the answer is yes, then any experience with a migration like this?
Extra info: I found this article https://github.com/kubernetes/kops/blob/master/docs/networking.md#switching-between-networking-providers which is under the Kops repo saying that, in a nutshell, changing between CNI providers is not really recommended.
Thanks guys
We want to avoid not reusing the address released in a short time.
Now Coil allocates an IP Address from an AddressBlock, it allocates the address of the smallest available index.
Therefore, we should fix not to use the address released in a short time.
The AddressBlock
resource should have the index last allocated by Coil (lastAllocatedIndex
).
And when Coil allocates the new address to a pod, Coil should pick the address next to lastAllocatedIndex
.
All Egress NAT pods can disappear at the same time when rebooting nodes because currently there's no PDB for them.
Create PDB for Egress NAT pods
Describe the bug
Coild doesn't release AddressBlock which is no longer needed when it starts-up.
To Reproduce
Steps to reproduce the behavior:
kubectl get addressblock
and see an AddressBlock which is no longer needed is not releasedExpected behavior
Coild releases unused AddressBlocks.
Describe the bug
After creating a kubernetes cluster with the default service IP's and installing Coil as a CNI (no other CNI's) pods are not able to communicate directly.
Environments
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:25:17Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Linux HostnameHere 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Being able to access other pods in same (or other) namespace.
Additional context
It doesn't matter if both pods are scheduled on the same node, traceroute makes it seem like traffic cannot be delivered to the pod.
Example traceroute (simplified):
when CURL'ing the services' ClusterIP from the node itself, or even from another node, everything works as expected.
Is this a mis-configuration?
Describe the bug
I cannot seem to build coil v2 with kustomize following the setup guide in docs/setup.md
.
Environments
kustomize v4.1.3
and k8s 1.21.1
To Reproduce
$ git clone https://github.com/cybozu-go/coil
<SNIP>
$ cd coil/v2
$ make certs
go run ./cmd/gencert -outdir=/root/coil/coil2/v2/config/default
$ # kustomization.yaml needed no change
$ # netconf.json changed just like in setup.md but with MTU 9000
$ kustomize build . > coil.yaml
Error: field specified in var '{CACERT ~G_v1_Secret {data[ca.crt]}}' not found in corresponding resource
Expected behavior
kustomize builds coil.yaml successfully.
An issue in kube-proxy with IPVS mode breaks some CNI plugins. Coil is no exception.
kubernetes/kubernetes#71555
An obvious workaround is to run kube-proxy in iptables mode.
Coil should support the latest three Kubernetes versions.
As of Aug. 2019, it should support Kubernetes 1.13, 1.14, and 1.15.
Run end-to-end test on all these versions.
Describe how to address the issue.
Currently, coil-egress uses the fixed encap-sport 5555, and it causes some issues.
Investigate the tunnel collect metadata mode and use it to lift the limitation described above.
Describe the bug
I had a coredns pod that was being scheduled on a master node over and over, which failed due to CNI version incompatibility. Each restart resulted in a new addressblock reservation. Address blocks did not clear after each failed attemt and the full /16 is now used up, resulting in pods stuck in creating phase.
Environments
Expected behavior
Block is being cleared on pod finalization, even on master nodes.
Support Kubernetes 1.22
Coil already supports k8s 1.20, but it depends on controller-runtime 0.6.3.
node-role.kubernetes.io/master
label is deprecated in k8s 1.20 and node-role.kubernetes.io/control-plane
is introduced.
Update controller-runtime to 0.8.3 and other dependencies.
Add node-role.kubernetes.io/control-plane
label to tolerations.
Describe the bug
When I describe node objects I expect correct podCIDR to have easier debugging capabilities.
Environments
To Reproduce
Steps to reproduce the behavior:
kubectl get nodes
podCIDR
podCIDRs
Expected behavior
If coild assigns podCIDR it should also update the node object accordingly.
Additional context
kubectl describe node
PodCIDR: 172.30.8.0/24
PodCIDRs: 172.30.8.0/24
kubectl get addressblocks
default-4 <nodename> default 172.30.1.0/26
The number of clients for a coil-egress Pod can be useful metrics for HPA.
Add a gauge like this:
coil_egress_clinets{namespace="foo",egress="nat"} 30
Describe the bug
Due to a bug in kube-apiserver, although there are still AddressBlocks
curved from an AddressPool, the AddressPool would be deleted even
though AddressBlock has ownerReferences with blockOwnerDeletion=true.
kubernetes/kubernetes#86509 (comment)
So, AddressPool should have a finalizer until all of its child AddressBlocks get deleted.
Environments
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
CNI spec v1.0 has been released.
https://github.com/containernetworking/cni/blob/v1.0.1/Documentation/spec-upgrades.md
Check breaking changes and new features and take necessary actions.
It seems that module github.com/cybozu-go/coil/v2
does not depend on github.com/dgrijalva/jwt-go
any more, both directly and indirectly.
So, replace usage left in go.mod makes no sense. Should it be dropped?
$ go mod why -m github.com/golang-jwt/jwt/v4
# github.com/golang-jwt/jwt/v4
(main module does not need module github.com/golang-jwt/jwt/v4)
https://github.com/cybozu-go/coil/blob/main/v2/go.mod#L5
replace github.com/dgrijalva/jwt-go => github.com/golang-jwt/jwt/v4 v4.4.2
Design the new representation of configurations and states of Coil v2.
They should be represented as CustomResourceDefinition (CRD) or ConfigMaps.
Write documents about them.
I was setup the coil natgateway on the kind created cluster according to the READ.ME command.
$ cd v2
$ make certs
$ make image
$ cd e2e
$ make start
$ make install-coil
but I encountered some CNI issues.
There are some pending pods and core DNS was holding on the ContainerCreating.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coil-controller-866b8fd666-5b5l4 0/1 Pending 0 14m <none> <none> <none> <none>
kube-system coil-controller-866b8fd666-f5zmw 1/1 Running 0 14m 172.18.0.5 coil-control-plane <none> <none>
kube-system coil-router-6w6sp 1/1 Running 0 14m 172.18.0.5 coil-control-plane <none> <none>
kube-system coil-router-fp7rb 1/1 Running 0 14m 172.18.0.6 coil-worker2 <none> <none>
kube-system coil-router-xkvzl 1/1 Running 0 14m 172.18.0.7 coil-worker <none> <none>
kube-system coild-d7fh7 1/1 Running 0 14m 172.18.0.7 coil-worker <none> <none>
kube-system coild-j66hh 1/1 Running 0 14m 172.18.0.6 coil-worker2 <none> <none>
kube-system coild-nmsnv 1/1 Running 0 14m 172.18.0.5 coil-control-plane <none> <none>
kube-system coredns-bd6b6df9f-fmgd6 0/1 ContainerCreating 0 17m <none> coil-control-plane <none> <none>
kube-system coredns-bd6b6df9f-qqd47 0/1 ContainerCreating 0 17m <none> coil-control-plane <none> <none>
kube-system etcd-coil-control-plane 1/1 Running 0 17m 172.18.0.5 coil-control-plane <none> <none>
kube-system kube-apiserver-coil-control-plane 1/1 Running 0 17m 172.18.0.5 coil-control-plane <none> <none>
kube-system kube-controller-manager-coil-control-plane 1/1 Running 0 17m 172.18.0.5 coil-control-plane <none> <none>
kube-system kube-proxy-4rdh7 1/1 Running 0 16m 172.18.0.6 coil-worker2 <none> <none>
kube-system kube-proxy-j4b48 1/1 Running 0 17m 172.18.0.5 coil-control-plane <none> <none>
kube-system kube-proxy-zwnnd 1/1 Running 0 17m 172.18.0.7 coil-worker <none> <none>
kube-system kube-scheduler-coil-control-plane 1/1 Running 0 17m 172.18.0.5 coil-control-plane <none> <none>
local-path-storage local-path-provisioner-6fd4f85bbc-vbv58 0/1 ContainerCreating 0 17m <none> coil-control-plane <none> <none>
when I describe the pods
kubectl describe pods coredns-bd6b6df9f-fmgd6 -n kube-system
It's show
failed (add): failed to allocate address; aborting new block request: context deadline exceeded
Warning FailedCreatePodSandBox 50s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6516770d68bb935f8746c5a0ef17636d028ecd932ba0dbe2fd686a63e19ff935": plugin type="coil" failed (add): failed to allocate address; aborting new block request: context deadline exceeded
Warning FailedCreatePodSandBox 20s kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3d0c988e6c39ff1183964d6adc34105998919be5ec61af6c1214526781fc6d52": plugin type="coil" failed (add): failed to allocate address; aborting new block request: context deadline exceeded
Would you like to share some experience about how to trouble shooting this issue?
Coil should support the new CNI spec 0.4.0
https://github.com/containernetworking/cni/releases/tag/v0.7.0
Describe how to address the issue.
Add support for Kubernetes 1.19.
Since coil is built using kubebuilder and controller-runtime and they have not used client-go of k8s 1.19 yet, we cannot upgrade client libraries for k8s 1.19.
Add k8s 1.19 to the matrix of CI and fix problems if any.
Upgrade the container base image from Ubuntu 18.04 to Ubuntu 20.04.
Use Go 1.15 to build binaries.
Describe the bug
Egress traffics is disconnected for about 30 seconds when deleting an Egress Pod.
Environments
To Reproduce
.spec.replicas
is 2 or more.)ping
to the external network via the Egress NAT from the client pod.When deleting an Egress Pod, sometimes ping
will fail (packets lost) for about 30 seconds.
Expected behavior
There are some replicas. So I want to switch the Egress Pod immediately.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.