Giter Club home page Giter Club logo

Comments (9)

ShylajaDevadiga avatar ShylajaDevadiga commented on August 17, 2024 1

Additional info:
While joining second node using binary, etcdserver shows timed out. Typically we need to run the binary again to join the other nodes.

# kubectl get nodes
Error from server: etcdserver: request timed out
# kubectl get nodes
Error from server: etcdserver: leader changed
root@ip-172-31-14-93:~# 

Logs:

E0709 16:15:08.373041    2003 leaderelection.go:320] error retrieving resource lock kube-system/rke2: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/rke2": context deadline exceeded
I0709 16:15:08.373089    2003 leaderelection.go:277] failed to renew lease kube-system/rke2: timed out waiting for the condition
E0709 16:15:08.373141    2003 leaderelection.go:296] Failed to release lock: resource name may not be empty
FATA[0342] leaderelection lost for rke2 

from rke2.

briandowns avatar briandowns commented on August 17, 2024

Please test again with release alpha.7. This issue appears to be the result of a previous bug that's been squashed.

from rke2.

ShylajaDevadiga avatar ShylajaDevadiga commented on August 17, 2024

@briandowns tested on alpha7 release, does not seem to have fixed the issue. master or agent is unable to join the first node.

from rke2.

briandowns avatar briandowns commented on August 17, 2024

Interesting. We'll have to go over this tomorrow as it worked fine iwth the same commands.

from rke2.

briandowns avatar briandowns commented on August 17, 2024

Part of the issuing causing this was enviornment variable use in the k3s code which was patched yesterday. There's a continued issue with etcd in regards to raft that's currently being troubleshot. A second master is able to be listed on both nodes however quorum is lost and leader election fails.

from rke2.

ShylajaDevadiga avatar ShylajaDevadiga commented on August 17, 2024

Using rke2 version v1.18.4-alpha9-rke2, second node is unable to join the master.
Commands used:
Node1:

INSTALL_RKE2_VERSION=v1.18.4-alpha9+rke2 ./install.sh

Node2:
INSTALL_RKE2_VERSION=v1.18.4-alpha9+rke2 RKE2_URL=https://Node1_IP:9345 RKE2_TOKEN= INSTALL_RKE2_EXEC='server' ./install.sh

Adding Node configuration:
Node OS: Ubuntu 20.04

free -h
total used free shared buff/cache available
Mem: 7.8Gi 1.0Gi 1.9Gi 2.0Mi 4.8Gi 6.7Gi
df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/root 20G 5.5G 14G 29% /



from rke2.

galal-hussein avatar galal-hussein commented on August 17, 2024

Main issue

when a new server is added to a one node cluster, the primary node exits with the following error:

Jul 14 19:52:08 ip-172-31-18-137 rke2[975415]: E0714 19:52:08.236601  975415 leaderelection.go:320] error retrieving resource lock kube-system/rke2: etcdserver: request timed out
Jul 14 19:52:23 ip-172-31-18-137 rke2[975415]: E0714 19:52:23.234378  975415 leaderelection.go:320] error retrieving resource lock kube-system/rke2: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/rke2": context deadline exceeded
Jul 14 19:52:23 ip-172-31-18-137 rke2[975415]: I0714 19:52:23.234430  975415 leaderelection.go:277] failed to renew lease kube-system/rke2: timed out waiting for the condition
Jul 14 19:52:23 ip-172-31-18-137 rke2[975415]: E0714 19:52:23.234501  975415 leaderelection.go:296] Failed to release lock: resource name may not be empty
Jul 14 19:52:23 ip-172-31-18-137 rke2[975415]: time="2020-07-14T19:52:23Z" level=fatal msg="leaderelection lost for rke2"

Root Cause

After debugging we pinned down the issue to the following:

1- When the first node starts it enters leaderelection thread https://github.com/rancher/k3s/blob/master/pkg/server/server.go#L152 with a callback to start the server controllers when it start leading

2- When a second server node starts it adds itself as etcd member and then starts etcd pod

3- the time gap between adding itself as a member and starting the etcd server causing the etcd cluster to lose quorum (since the cluster has two nodes and one is not ready)

4- in the meanwhile the first node in the leaderelection thread tries to lock the resource for leaderelection every two seconds, https://github.com/rancher/k3s/blob/f7dae176e9fe03aece1076f53d86eef298436e4e/vendor/github.com/rancher/wrangler/pkg/leader/leader.go#L55 and it fails because k8s cluster is down so it exits with leader election issue

Suggested solution

To take advantage of the learner feature in etcd 3.4 where you can add a member as a learner not a voting member and then promote it to a voting member after it starts, so the joining logic will be:

  • add member as a learner
  • start etcd
  • make sure its healthy
  • promote it to a voting member

This will avoid the cluster being in a down state when adding a new member

cc @briandowns

from rke2.

davidnuzik avatar davidnuzik commented on August 17, 2024

We're going to move forward with the suggested solution. We got the OK from Craig and this is essentially in sync with recommendations in etcd docs.

from rke2.

rancher-max avatar rancher-max commented on August 17, 2024

Validated with alpha22 and beta1

  • In both rke2 versions, I am able to join a second node with no issues. It comes up successfully as an etcd node after becoming ready and all pods are in Running state.
  • This still works correctly when setting a few different runtime options (--token, --tls-san, and cis 1.5 mode)

from rke2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.