Comments (29)
For policymap map consumption, there's a specific troubleshooting guide: https://docs.cilium.io/en/stable/operations/troubleshooting/#policymap-pressure-and-overflow.
from cilium.
Hi! Please attach a sysdump collected from the affected node. Also, could you collect pprof traces when you are experiencing high cpu usage?
from cilium.
I managed to capture pprof during the issue for 300s happening (it was pretty mild case):
During the incident (300s):
After the incident (300s):
from cilium.
Do you have any CPU limits applied on the Cilium daemonset that could be causing K8s to pin Cilium to one CPU?
from cilium.
I would expect these messages and the spike of CPU to occur at Cilium startup time. Is it persistent?
from cilium.
Do you have any CPU limits applied on the Cilium daemonset that could be causing K8s to pin Cilium to one CPU?
Only requests, as in your chart (and cilium-agent
is hammering a single core when the issue occurs):
$ k get ds -n kube-system cilium -o yaml | grep resources: -B1 -A 5
Warning: short name "ds" could also match lower priority resource daemonsets.tap.linkerd.io
timeoutSeconds: 5
resources: {}
securityContext:
capabilities:
add:
- CHOWN
- KILL
--
name: config
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp
name: tmp
--
name: mount-cgroup
resources: {}
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_CHROOT
--
name: apply-sysctl-overwrites
resources: {}
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_CHROOT
--
name: mount-bpf-fs
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
--
name: clean-cilium-state
resources:
requests:
cpu: 100m
memory: 100Mi
securityContext:
capabilities:
--
name: cleanup-kube-proxy-iptables
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
--
name: install-cni-binaries
resources:
requests:
cpu: 100m
memory: 10Mi
securityContext:
capabilities:
from cilium.
I would expect these messages and the spike of CPU to occur at Cilium startup time. Is it persistent?
It happens randomly. And it's the other way around, restarting cilium pod solves the issue. It also stops after some time. I couldn't spot any pattern. Sometimes it isn't ongoing at all, sometimes it's happening multiple times a day.
Check this pod for example (from a different cluster). The issue just started there around 2h ago. The pod has 8 days, so definitely not during startup.
$ k get pod -n kube-system cilium-rrl5g
NAME READY STATUS RESTARTS AGE
cilium-rrl5g 1/1 Running 0 8d
from cilium.
Some logs from this pod:
cilium-rrl5g cilium-agent 2024-03-27T11:57:04.793372471Z level=info msg="hubble events queue is processing messages again: 2786 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:06.590436158Z level=info msg="hubble events queue is processing messages again: 953 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:06.667555210Z level=info msg="hubble events queue is processing messages again: 30 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:33.669955939Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:35.577782191Z level=info msg="hubble events queue is processing messages again: 1149 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:35.820730937Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.523875441Z level=info msg="hubble events queue is processing messages again: 1197 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.531121740Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.574087467Z level=info msg="hubble events queue is processing messages again: 57 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.438284773Z level=info msg="hubble events queue is processing messages again: 3678 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.498583245Z level=info msg="hubble events queue is processing messages again: 12 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.709256948Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.466733873Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.466959529Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.467514164Z level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or --dnsproxy-lock-timeout" dnsName=loki.g8s.________. duration=1.685595826s expected=500ms subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.522924632Z level=info msg="hubble events queue is processing messages again: 1369 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:47.512835127Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:49.549721810Z level=info msg="hubble events queue is processing messages again: 2714 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:55.538976765Z level=info msg="hubble events queue is processing messages again: 3699 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:55.611009249Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:57.206446178Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:59.247446557Z level=info msg="hubble events queue is processing messages again: 2616 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:58:00.753079986Z level=info msg="hubble events queue is processing messages again: 1315 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:58:00.755955965Z level=info msg="hubble events queue is processing messages again: 4 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:59:45.468853757Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T12:00:38.889759093Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:45.470007593Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T12:00:48.987167613Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:49.988317445Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:51.575339895Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:52.594914261Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:55.649573618Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:00.566374704Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:01.566488607Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:02.566822237Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:03.566834877Z level=info msg="regenerating all endpoints" reason= subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:05.308364028Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:06.309240293Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:14.592780031Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:15.592987778Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
from cilium.
Unfortunately I couldn't pprof because we don't have pprof enabled on the this cluster and restarting the pod fixes the issue so I can't enable it without fixing the problem.
from cilium.
So the fact that we are seeing
level=info msg="hubble events queue is processing messages again..."
means this must be an older Cilium version. That log msg should be a debug level since v1.14.0 (cf. #22865).
Which version of Cilium are those most recent logs from? I would suggest upgrading to the latest Cilium version.
from cilium.
Yes, this is 1.13, I noted that in the description, but I'll copy the information below for convenience. In my synthetic tests I noticed some improvements in terms of performance in 1.14 but I couldn't reproduce the issue we have on the "real" clusters.
Was there something in 1.14+ that should improve the performance? Especially when it comes processing CIDR-based network policies?
Cilium Version
$ cilium version
cilium-cli: v0.14.5 compiled with go1.20.4 on linux/amd64
cilium image (default): v1.13.2
cilium image (stable): v1.15.1
cilium image (running): v1.13.6
Kernel Version
$ uname -a
Linux ip-10-150-21-29.eu-west-1.compute.internal 6.1.43-flatcar #1 SMP PREEMPT_DYNAMIC Mon Aug 7 22:10:45 -00 2023 x86_64 AMD EPYC 7571 AuthenticAMD GNU/Linux
Kubernetes Version
$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.17
from cilium.
BTW upgrade to 1.14 is on schedule, but we'd like to know what is causing this. Or what was improved in 1.14+ in this regard. Do you have any idea?
from cilium.
@christarazi do you know anything improved in 1.14+ specific to this issue?
from cilium.
I would suggest reading the changelog. I don't remember any specific changes but I know that there were many changes that went into improving this area.
from cilium.
We updated this cluster's cilium to 1.15.1:
$ k exec -it -n kube-system cilium-nsvzl -- cilium version
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), cleanup-kube-proxy-iptables (init), install-cni-binaries (init)
Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Now we're noticing that one of the cilium agents on this cluster has higher cilium_policy_*
map consumption:
Logs are also full of:
cilium-agent level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
Does that ring any bell?
from cilium.
msg="regenerating all endpoints" reason="one or more identities created or deleted"
log is probably not the right thing to chase, because it looks similar on all
from cilium.
Any idea why would one agent had significantly higher cilium_policy_*
pressure than others? This node doesn't even run more pods than other nodes in the cluster.
from cilium.
Thanks, can you tell what are USERS
in the output of cilium policy selectors
?
I noticed in there is this selector &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}
(empty?) in the output and it holds and it has 5.5k identities. Is this normal?
from cilium.
That selector says that it's selecting all identities in the cluster. Empty selector = all. Which would explain why there's policymap pressure, as selecting all identities would mean a policymap entry per identity.
from cilium.
Thanks @christarazi. Is there a straight forward way to tell for which policy this selector is generated?
Also can you tell what are USERS
in the output of cilium policy selectors
?
from cilium.
Thanks @christarazi. Is there a straight forward way to tell for which policy this selector is generated?
That command already tells you which policy the selector comes from under the LABELS
column.
Also can you tell what are
USERS
in the output of cilium policy selectors?
USERS
is a bit low level information but IIRC it tracks the number of rules that the selector "uses" or "applies to".
from cilium.
Hmm LABELS
column is completely empty in my case...
from cilium.
k exec -it -n kube-system cilium-s9jtf -c cilium-agent -- cilium policy selectors -v | head -n 4
SELECTOR LABELS USERS IDENTITIES
&LabelSelector{MatchLabels:map[string]string{any.app.kubernetes.io/name: loki,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},} 6
&LabelSelector{MatchLabels:map[string]string{any.k8s-app: coredns,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},} 6 17858
32450
from cilium.
You might be running a version where that part of the command broke, but the most recent version works fine for me.
from cilium.
It's 1.15.1:
$ k exec -it -n kube-system cilium-s9jtf -c cilium-agent -- cilium version
Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
from cilium.
3515 &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},}
3515 &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}
3553 &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}
Do you know what are &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},}
or &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}
?
from cilium.
3515 &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},} 3515 &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},} 3553 &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}
Do you know what are
&LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},}
or&LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}
?
The first looks like a selector that selects anything with a namespace label. The second looks like a selector that selects anything inside the default
cluster. Both of these I would guess select all identities because every pod has a namespace label and every pod has a default
cluster label.
from cilium.
This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.
from cilium.
Seems like the issue went away after upgrading to 1.15.
from cilium.
Related Issues (20)
- CFP: vlan support HOT 2
- CFP: Envoy: allow configuration of preserve_external_request_id in HttpConnectionManager
- Remove dynamic map names in Cilium agent HOT 4
- CI: Conformance Ginkgo: K8sDatapathServicesTest Checks N/S loadbalancing [It] Tests with XDP, direct routing, Hybrid and Random HOT 2
- CI: Conformance Ginkgo: K8sDatapathServicesTest Checks N/S loadbalancing Supports IPv4 fragments [It] HOT 1
- stand-alone lb doesn't work work with dsr geneve encapsulation HOT 4
- CI: Runtime Test (agent): RuntimeAgentFQDNPolicies
- ingressController.hostNetwork: does not update Ingress Address HOT 1
- Move CID GC into CID controller (operator ID management)
- Cluster Scope IPAM melts DRBD storage fabric HOT 2
- CLI: status command does not show helm chart version
- cilium-operator: attempts to reconcile non-Cilium xRoutes
- East-west routed traffic goes through the `ingress` chain, preventing the use of network policies
- Can't get CiliumNetoworkPolicy to work as a firewall HOT 3
- bpf_lxc: support LB for nodeports in the per-packet LB (wildcard lookup)
- ClusterIP Service routes traffic to incorrect destination pod after upgrading EKS masters to 1.29 HOT 2
- CiliumNetworkPolicy ingress.fromCIDR doesnt work with IPv6 HOT 2
- CFP: Cilium dedicated ingress for some endpoints and shared for others
- CFP: E2E tests for delegated IPAM HOT 1
- Consolidate SPI usage HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cilium.