Giter Club home page Giter Club logo

Comments (10)

dlapcevic avatar dlapcevic commented on September 27, 2024 4

Hi, it looks like an oversight. The PodStore in cilium-operator is not initialized while it's still used for IPAM through ciliumNodeSynchronizer.

Now cilium-operator uses resource.Resource[*v1.Pod] and it wouldn't make sense to initialize another watcher.

Before migrating ciliumNodeSynchronizer to a cell, we could add this to fix the issue. It will use the PodStore from the Pod resource that is always initialized on cilium-operator start, as part of the operatorK8s.ResourcesCell:

podStore, err := legacy.resources.Pods.Store(legacy.ctx)
if err != nil {
	log.Fatalf("Pod Store: %v", err)
}
operatorWatchers.PodStore = podStore.CacheStore()

I opened a PR with the fix. #34090

from cilium.

vflaux avatar vflaux commented on September 27, 2024 2

tl;dr: it just delays the allocation of IPs when a node suddenly need a lot of IPs (16 for example).

If a node need more IP than what have been pre allocated (8 by default), it will have to wait for the next reconciliation cycle (each minutes) for the cilium operator to try to allocate more IPs. In my understanding of the code, this is the expected behavior so far.
When the reconciliation occurs, the controller tries to list the pending pods for the node to know how many IPS it needs to allocate (called "surge allocation" in the code). Without it, the controller just ask to allocates the number of IPs needed for the node to have its number of "pre allocated" satisfied. But this is only based on the number of IP already allocated and actually in use. As a result it will never ask to allocate more than this "pre-allocate value" per reconciliation cycle (when "allocated" equals "in use").
This is not a problem until it needs more than 2x the "pre-allocate value". If you need more, it will just delay the allocation of the next packet of IPs until the next reconciliation (one minute).
It most likely do not affect many real world clusters, and that could explain why this has not been reported since the 1.15.0.
I found this issue in a testing environment on eks by allocating 80 pods to a node instantly. Cilium took minutes to allocate all the IPs needed which surprised me.

from cilium.

vflaux avatar vflaux commented on September 27, 2024 1

Hi, I could, but the code hasn't changed in the v1.15 branch or in main, so I'm pretty sure the result is the same.
You should see this log systematically since Cilium 1.15.

Edit:
With Cilium 1.15.5:

time="2024-05-30T10:44:20Z" level=info msg="Discovered new CiliumNode custom resource" name=ip-xxx-xxx-xxx-xxx.xxx.compute.internal subsys=ipam
time="2024-05-30T10:44:20Z" level=warning msg="Unable to compute pending pods, will not surge-allocate" error="pod store uninitialized" instanceID=i-xxx name=ip-xxx-xxx-xxx-xxx.xxx.compute.internal subsys=ipam

from cilium.

Dhouti avatar Dhouti commented on September 27, 2024 1

Also encountering this issue on 1.15.6. Observed while doing a rollout of a new AMI in EKS, some pods being scheduled to a new node would go into crashloopbackoff for several minutes due to lack of IPs until the operator caught up.

Looking at enabling prefix delegation, will likely function as a workaround for my specific encounter.

from cilium.

joestringer avatar joestringer commented on September 27, 2024

Hi, can you confirm whether you experience this with the latest v1.15.x release? I see the same log mentioned in #31016 , so I'm not sure whether it was already resolved in a newer Cilium bugfix release.

from cilium.

joestringer avatar joestringer commented on September 27, 2024

OK, thanks for confirming. It looks like PodsInit was last in the code prior to 1a9a1e7 (#28233). I acknowledge it's no longer called in any v1.15.x released versions. @alan-kut do you know why this path is no longer called after the migration to Hive? Is this expected to be implicitly called somehow or was this an oversight?

from cilium.

alex-berger avatar alex-berger commented on September 27, 2024

Observing the same behavior with Cilium version 1.15.6. What negative effect does this have, what will not work if this happens?

from cilium.

alex-berger avatar alex-berger commented on September 27, 2024

@vflaux Thank you very much for that explanation, that saved me several hours or days of reading through the code :-)

I know "one minute" might not sound much, but in our case where we are (beside others) running GitLab Runners (CI Jobs) on Kubernetes, each second matters. Our clusters are optimized for fast Pod startup and fully automatic scaling using tools like Karpenter. So, fixing issues like this will make the difference :-)

from cilium.

alex-berger avatar alex-berger commented on September 27, 2024

I was somehow hoping, that this might get fixed for 1.16.0, but it was not fixed yet.

@joestringer Are there any plans to fix this?

from cilium.

bowei avatar bowei commented on September 27, 2024

Hi @dlapcevic -- looks like Alan has the context but he has moved onto other projects. Can you take a look?

from cilium.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.