Giter Club home page Giter Club logo

Comments (29)

christarazi avatar christarazi commented on September 26, 2024 1

For policymap map consumption, there's a specific troubleshooting guide: https://docs.cilium.io/en/stable/operations/troubleshooting/#policymap-pressure-and-overflow.

from cilium.

aditighag avatar aditighag commented on September 26, 2024

Hi! Please attach a sysdump collected from the affected node. Also, could you collect pprof traces when you are experiencing high cpu usage?

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

I managed to capture pprof during the issue for 300s happening (it was pretty mild case):

During the incident (300s):

pprof-cpu-cilium-bzfnk-during

After the incident (300s):

pprof-cpu-cilium-bzfnk-after

from cilium.

christarazi avatar christarazi commented on September 26, 2024

Do you have any CPU limits applied on the Cilium daemonset that could be causing K8s to pin Cilium to one CPU?

from cilium.

christarazi avatar christarazi commented on September 26, 2024

I would expect these messages and the spike of CPU to occur at Cilium startup time. Is it persistent?

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Do you have any CPU limits applied on the Cilium daemonset that could be causing K8s to pin Cilium to one CPU?

Only requests, as in your chart (and cilium-agent is hammering a single core when the issue occurs):

$ k get ds -n kube-system cilium -o yaml | grep resources: -B1 -A 5
Warning: short name "ds" could also match lower priority resource daemonsets.tap.linkerd.io
          timeoutSeconds: 5
        resources: {}
        securityContext:
          capabilities:
            add:
            - CHOWN
            - KILL
--
        name: config
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /tmp
          name: tmp
--
        name: mount-cgroup
        resources: {}
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_CHROOT
--
        name: apply-sysctl-overwrites
        resources: {}
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_CHROOT
--
        name: mount-bpf-fs
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
--
        name: clean-cilium-state
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          capabilities:
--
        name: cleanup-kube-proxy-iptables
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
--
        name: install-cni-binaries
        resources:
          requests:
            cpu: 100m
            memory: 10Mi
        securityContext:
          capabilities:

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

I would expect these messages and the spike of CPU to occur at Cilium startup time. Is it persistent?

It happens randomly. And it's the other way around, restarting cilium pod solves the issue. It also stops after some time. I couldn't spot any pattern. Sometimes it isn't ongoing at all, sometimes it's happening multiple times a day.

Check this pod for example (from a different cluster). The issue just started there around 2h ago. The pod has 8 days, so definitely not during startup.

$ k get pod -n kube-system cilium-rrl5g                 
NAME           READY   STATUS    RESTARTS   AGE
cilium-rrl5g   1/1     Running   0          8d

image

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Some logs from this pod:

cilium-rrl5g cilium-agent 2024-03-27T11:57:04.793372471Z level=info msg="hubble events queue is processing messages again: 2786 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:06.590436158Z level=info msg="hubble events queue is processing messages again: 953 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:06.667555210Z level=info msg="hubble events queue is processing messages again: 30 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:33.669955939Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:35.577782191Z level=info msg="hubble events queue is processing messages again: 1149 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:35.820730937Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.523875441Z level=info msg="hubble events queue is processing messages again: 1197 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.531121740Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:37.574087467Z level=info msg="hubble events queue is processing messages again: 57 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.438284773Z level=info msg="hubble events queue is processing messages again: 3678 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.498583245Z level=info msg="hubble events queue is processing messages again: 12 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:43.709256948Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.466733873Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.466959529Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.467514164Z level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or --dnsproxy-lock-timeout" dnsName=loki.g8s.________. duration=1.685595826s expected=500ms subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:45.522924632Z level=info msg="hubble events queue is processing messages again: 1369 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:47.512835127Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:49.549721810Z level=info msg="hubble events queue is processing messages again: 2714 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:55.538976765Z level=info msg="hubble events queue is processing messages again: 3699 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:55.611009249Z level=warning msg="hubble events queue is full: dropping messages; consider increasing the queue size (hubble-event-queue-size) or provisioning more CPU" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:57:57.206446178Z level=error msg="Timed out waiting for datapath updates of FQDN IP information; returning response" subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T11:57:59.247446557Z level=info msg="hubble events queue is processing messages again: 2616 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:58:00.753079986Z level=info msg="hubble events queue is processing messages again: 1315 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:58:00.755955965Z level=info msg="hubble events queue is processing messages again: 4 messages were lost" subsys=hubble
cilium-rrl5g cilium-agent 2024-03-27T11:59:45.468853757Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T12:00:38.889759093Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:45.470007593Z level=info msg="FQDN garbage collector work deleted 1 name entries: loki.g8s.________." controller=dns-garbage-collector-job subsys=daemon
cilium-rrl5g cilium-agent 2024-03-27T12:00:48.987167613Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:49.988317445Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:51.575339895Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:52.594914261Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:00:55.649573618Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:00.566374704Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:01.566488607Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:02.566822237Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:03.566834877Z level=info msg="regenerating all endpoints" reason= subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:05.308364028Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:06.309240293Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:14.592780031Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
cilium-rrl5g cilium-agent 2024-03-27T12:01:15.592987778Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Unfortunately I couldn't pprof because we don't have pprof enabled on the this cluster and restarting the pod fixes the issue so I can't enable it without fixing the problem.

from cilium.

christarazi avatar christarazi commented on September 26, 2024

So the fact that we are seeing

level=info msg="hubble events queue is processing messages again..."

means this must be an older Cilium version. That log msg should be a debug level since v1.14.0 (cf. #22865).

Which version of Cilium are those most recent logs from? I would suggest upgrading to the latest Cilium version.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Yes, this is 1.13, I noted that in the description, but I'll copy the information below for convenience. In my synthetic tests I noticed some improvements in terms of performance in 1.14 but I couldn't reproduce the issue we have on the "real" clusters.

Was there something in 1.14+ that should improve the performance? Especially when it comes processing CIDR-based network policies?


Cilium Version

$ cilium version 
cilium-cli: v0.14.5 compiled with go1.20.4 on linux/amd64
cilium image (default): v1.13.2
cilium image (stable): v1.15.1
cilium image (running): v1.13.6

Kernel Version

$ uname -a
Linux ip-10-150-21-29.eu-west-1.compute.internal 6.1.43-flatcar #1 SMP PREEMPT_DYNAMIC Mon Aug  7 22:10:45 -00 2023 x86_64 AMD EPYC 7571 AuthenticAMD GNU/Linux

Kubernetes Version

$ kubectl version                                                                                   
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.17

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

BTW upgrade to 1.14 is on schedule, but we'd like to know what is causing this. Or what was improved in 1.14+ in this regard. Do you have any idea?

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

@christarazi do you know anything improved in 1.14+ specific to this issue?

from cilium.

christarazi avatar christarazi commented on September 26, 2024

I would suggest reading the changelog. I don't remember any specific changes but I know that there were many changes that went into improving this area.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

We updated this cluster's cilium to 1.15.1:

$ k exec -it -n kube-system cilium-nsvzl -- cilium version
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), cleanup-kube-proxy-iptables (init), install-cni-binaries (init)
Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64

Now we're noticing that one of the cilium agents on this cluster has higher cilium_policy_* map consumption:

scr

Logs are also full of:

cilium-agent level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager

Does that ring any bell?

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

msg="regenerating all endpoints" reason="one or more identities created or deleted" log is probably not the right thing to chase, because it looks similar on all

src

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Any idea why would one agent had significantly higher cilium_policy_* pressure than others? This node doesn't even run more pods than other nodes in the cluster.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Thanks, can you tell what are USERS in the output of cilium policy selectors?

I noticed in there is this selector &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},} (empty?) in the output and it holds and it has 5.5k identities. Is this normal?

from cilium.

christarazi avatar christarazi commented on September 26, 2024

That selector says that it's selecting all identities in the cluster. Empty selector = all. Which would explain why there's policymap pressure, as selecting all identities would mean a policymap entry per identity.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Thanks @christarazi. Is there a straight forward way to tell for which policy this selector is generated?

Also can you tell what are USERS in the output of cilium policy selectors?

from cilium.

christarazi avatar christarazi commented on September 26, 2024

Thanks @christarazi. Is there a straight forward way to tell for which policy this selector is generated?

That command already tells you which policy the selector comes from under the LABELS column.

Also can you tell what are USERS in the output of cilium policy selectors?

USERS is a bit low level information but IIRC it tracks the number of rules that the selector "uses" or "applies to".

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Hmm LABELS column is completely empty in my case...

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024
  k exec -it -n kube-system cilium-s9jtf -c cilium-agent -- cilium policy selectors -v | head -n 4
SELECTOR                                                                                                                                                                                  LABELS   USERS   IDENTITIES
&LabelSelector{MatchLabels:map[string]string{any.app.kubernetes.io/name: loki,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},}                       6       
&LabelSelector{MatchLabels:map[string]string{any.k8s-app: coredns,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},}                                   6       17858      
                                                                                                                                                                                                           32450 

from cilium.

christarazi avatar christarazi commented on September 26, 2024

You might be running a version where that part of the command broke, but the most recent version works fine for me.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

It's 1.15.1:

$ k exec -it -n kube-system cilium-s9jtf -c cilium-agent -- cilium version
Client: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64
Daemon: 1.15.1 a368c8f0 2024-02-14T22:16:57+00:00 go version go1.21.6 linux/amd64

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024
3515    &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},}
3515    &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}
3553    &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}

Do you know what are &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},} or &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}?

from cilium.

christarazi avatar christarazi commented on September 26, 2024
3515    &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},}
3515    &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}
3553    &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}

Do you know what are &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{LabelSelectorRequirement{Key:k8s.io.kubernetes.pod.namespace,Operator:Exists,Values:[],},},} or &LabelSelector{MatchLabels:map[string]string{k8s.io.cilium.k8s.policy.cluster: default,},MatchExpressions:[]LabelSelectorRequirement{},}?

The first looks like a selector that selects anything with a namespace label. The second looks like a selector that selects anything inside the default cluster. Both of these I would guess select all identities because every pod has a namespace label and every pod has a default cluster label.

from cilium.

github-actions avatar github-actions commented on September 26, 2024

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

from cilium.

kopiczko avatar kopiczko commented on September 26, 2024

Seems like the issue went away after upgrading to 1.15.

from cilium.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.