Giter Club home page Giter Club logo

Comments (10)

youngnick avatar youngnick commented on July 22, 2024 1

Okay, I've found the problem - it's that, in some cases, the CiliumEnvoyConfig reconciliation happens before the Service has been added into Cilium's stores, so the Nodeports can't be determined. Verified with a terrible delay patch, and talked to @joamaki and have figured out a way to fix this properly.

Folks affected by this error, it's been reverted on main, so a rebase should clear this up until I can get a proper fix in.

from cilium.

youngnick avatar youngnick commented on July 22, 2024 1

Ah yes, I think I fixed this with #33266 . Essentially, what was happening was that we would create the CiliumEnvoyConfig first, then the Service that it was referencing, so in the constrained CI environment, sometimes that would not happen fast enough for the CEC reconciliation to have seen the Service - and adding Nodeport checks into the CEC handling introduced a dependency.

So, I moved the Service creation ahead of the CEC creation, which caught most issues. The remaining times are actually handled by Hive retries on errors in processing, so all I had to do there was change the Error level logs that were being generated to Info (since reconciliation failures are most likely transient ones.) Hive will retry reconciliation failures periodically until they succeed.

from cilium.

youngnick avatar youngnick commented on July 22, 2024

So, in the change I added, CiliumEnvoyConfig processing is changed so that, if we are adding L7 proxy port forwarding for a Service that has Nodeports, we forward both the frontend port and the Nodeport to the proxy. (This is because the services in question are the special LB services for Ingress that don't do anything other than forward traffic to Envoy).

In all of the failures I've checked, at least one node does not have the ProxyPort rules visible in cilium bpf lb list. So, when the connectivity test tests that node will respond to the Ingress on that Nodeport, it fails.

In most cases, the ProxyPort rules never show up for that node for the lifetime of the test (they're not visible in the final sysdump, nor in any of the ones along the way). In one case, the rules were not present on one node in the first sysdump, then they showed up later.

So, it seems like this is caused by a node failing to apply the ProxyPort forwarding correctly for a single service. Which smells like an ordering-of-received-Kubernetes-updates issue to me. Going to ask for help from sig-datapath on this one.

from cilium.

youngnick avatar youngnick commented on July 22, 2024

I wrote a script to search through sysdumps, find the nodePort values for the cilium-ingress-same-node and cilium-ingress-other-node Services, and then grep the cilium bpf lb list output for those values:

#!/bin/bash

SYSDUMP_DIRS=$(find . -type d -name "cilium-sysdump-*" | sort)

for d in $SYSDUMP_DIRS; do

pushd $d || exit

echo Moving into $d
SVC_FILE=$(find . -name "k8s-service*.yaml")
SAME_NODEPORT=$(cat $SVC_FILE | yq '.items | map(select(.metadata.name == "cilium-ingress-same-node"))[0] | .spec.ports[0].nodePort')
OTHER_NODEPORT=$(cat $SVC_FILE | yq '.items | map(select(.metadata.name == "cilium-ingress-other-node"))[0] | .spec.ports[0].nodePort')
echo ================= Same Nodeport: $SAME_NODEPORT, Other Nodeport $OTHER_NODEPORT
for ciliumpod in $(find . -type d -name 'cilium*'); do echo $ciliumpod; cat ${ciliumpod}/cmd/cilium-dbg-bpf-lb-list.md | grep -E "$SAME_NODEPORT|$OTHER_NODEPORT";done

popd || exit

done

Representative output:

/Users/ynick/src/isovalent/issues/nodeport-ci/cilium-sysdumps-2/cilium-sysdump-12-20240618-052548 /Users/ynick/src/isovalent/issues/nodeport-ci
Moving into ./cilium-sysdumps-2/cilium-sysdump-12-20240618-052548
================= Same Nodeport: 32749, Other Nodeport 31088
./cilium-bugtool-cilium-s5xzv-20240618-052548
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable]
192.168.0.2:32749 (0)         0.0.0.0:0 (41) (0) [NodePort]
172.18.0.2:32749 (0)          0.0.0.0:0 (40) (0) [NodePort]
192.168.0.2:31088 (0)         0.0.0.0:0 (37) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 11507)
0.0.0.0:31088 (0)             0.0.0.0:0 (35) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 11507)
172.18.0.2:31088 (0)          0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 11507)
./cilium-bugtool-cilium-qlq76-20240618-052548
0.0.0.0:31088 (0)             0.0.0.0:0 (37) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 12362)
192.168.0.4:32749 (0)         0.0.0.0:0 (41) (0) [NodePort]
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable]
172.18.0.3:32749 (0)          0.0.0.0:0 (40) (0) [NodePort]
172.18.0.3:31088 (0)          0.0.0.0:0 (35) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 12362)
192.168.0.4:31088 (0)         0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 12362)
./cilium-bugtool-cilium-nhwl9-20240618-052548
192.168.0.3:32749 (0)         0.0.0.0:0 (41) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19370)
192.168.0.3:31088 (0)         0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19317)
172.18.0.4:31088 (0)          0.0.0.0:0 (35) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19317)
172.18.0.4:32749 (0)          0.0.0.0:0 (40) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19370)
0.0.0.0:31088 (0)             0.0.0.0:0 (37) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 19317)
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 19370)
/Users/ynick/src/isovalent/issues/nodeport-ci
/Users/ynick/src/isovalent/issues/nodeport-ci/cilium-sysdumps-2/cilium-sysdump-12-final /Users/ynick/src/isovalent/issues/nodeport-ci
Moving into ./cilium-sysdumps-2/cilium-sysdump-12-final
================= Same Nodeport: 32749, Other Nodeport 31088
./cilium-bugtool-cilium-s5xzv-20240618-053904
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable]
0.0.0.0:31088 (0)             0.0.0.0:0 (35) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 11507)
192.168.0.2:31088 (0)         0.0.0.0:0 (37) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 11507)
192.168.0.2:32749 (0)         0.0.0.0:0 (41) (0) [NodePort]
172.18.0.2:31088 (0)          0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 11507)
172.18.0.2:32749 (0)          0.0.0.0:0 (40) (0) [NodePort]
./cilium-bugtool-cilium-qlq76-20240618-053904
0.0.0.0:31088 (0)             0.0.0.0:0 (37) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 12362)
192.168.0.4:31088 (0)         0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 12362)
192.168.0.4:32749 (0)         0.0.0.0:0 (41) (0) [NodePort]
172.18.0.3:32749 (0)          0.0.0.0:0 (40) (0) [NodePort]
172.18.0.3:31088 (0)          0.0.0.0:0 (35) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 12362)
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable]
./cilium-bugtool-cilium-nhwl9-20240618-053904
172.18.0.4:32749 (0)          0.0.0.0:0 (40) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19370)
192.168.0.3:32749 (0)         0.0.0.0:0 (41) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19370)
172.18.0.4:31088 (0)          0.0.0.0:0 (35) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19317)
0.0.0.0:32749 (0)             0.0.0.0:0 (42) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 19370)
0.0.0.0:31088 (0)             0.0.0.0:0 (37) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 19317)
192.168.0.3:31088 (0)         0.0.0.0:0 (36) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 19317)

You can see in these examples that the 32749 port does not have Proxy Port forwards on some nodes.

from cilium.

thorn3r avatar thorn3r commented on July 22, 2024

Hit on https://github.com/cilium/cilium/actions/runs/9598146987/job/26468789478
PR: #33240
sysdump: cilium-sysdump-15-final.zip

from cilium.

jrajahalme avatar jrajahalme commented on July 22, 2024

@youngnick We handle the same problem with l7 service redirection by registering the redirection without assuming the service is already there. Maybe we could do something similar about the ports, allowing the service package to deal with teh actual port number when the service appears?

from cilium.

youngnick avatar youngnick commented on July 22, 2024

I looked into that, but it's not really practical as the service package doesn't hold the details we'd need to look up. I should be able to work up a PR on Monday my time, so I'll have something to show you early next week.

from cilium.

joestringer avatar joestringer commented on July 22, 2024

@youngnick how did that investigation go? Are next steps clear?

from cilium.

youngnick avatar youngnick commented on July 22, 2024

So I believe this should be fixed now.

from cilium.

joestringer avatar joestringer commented on July 22, 2024

Alright, let's close it out. We can always reopen if it turns out the fix wasn't enough.

from cilium.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.