Giter Club home page Giter Club logo

k3s-ha-cluster's Introduction

ha-k3s-cluster

I thought I'd share how I set up my Kubernetes personal playground (a WIP), and share my personal experience with the OVH Incident. Hopefully this might be useful for someone, and it gives me a chance to contribute back to the CNCN community.


Shortcuts


My starting cluster layout

As I'm on a tight cloud budget (after some unfortunate accidents 💲 💲), I usually go with the minimum 2 control nodes and 4~5 worker nodes, an HA external Postgres DB for k3s, an external LB (Nginx or HAproxy, I don't need HA LB for now), and a control node to serve as rebound for ssh and admin stuff. I usually try to spread those nodes across multiple providers / locations within the constraints of my budget.

Architecture Layout


Manual cluster install

K3s is currently my by-default option for kubernetes, I mainly rely on Rancher documentation to get going, with some adjustment that suit my needs. I summed up my step by step install here.


Ansible cluster install

I eventually moved to a faster more streamlined way of installing a cluster with ansible, using this ansible-role as base (I add custom stuff), and to save you some time here is the playbook i run it with.

---

- name: Build k3s cluster
  hosts: k3s_cluster
  vars:
    k3s_state: installed
    k3s_release_version: v1.19
    k3s_build_cluster: true
    k3s_become_for_all: true
    k3s_control_node_address: "{{k3s_loadbalancer_ip}}" # loadbalancer
    k3s_server:
      datastore-endpoint: "postgres://k3s:{{k3s_postgres_password}}@{{k3s_postgres_hostname}}:5432/k3s" # add ?sslmode=disable to disable ssl
      with-node-id: false
      docker: false
      write-kubeconfig-mode: "0644"
      node-taint:
        - "CriticalAddonsOnly=true:NoExecute" # prevent control nodes from acting as workers
      tls-san:
        - "{{k3s_loadbalancer_ip}}" # important to add LB ip to kubeconfig certificate
        # you can add other ip addreses
      disable:
        - traefik # to disable traefik and install it manually (if you want to use Traefik V2)
    k3s_agent:
      with-node-id: false
  roles:
    - role: k3s # the role mentioned above

You need to feed it an inventory that looks like this:

---
k3s_control_nodes:
  hosts:
    control-1:
      ansible_user: <ansible_user>
      ansible_host: <host_ip>
      k3s_control_node: true
    control-2:
      ansible_user: <ansible_user>
      ansible_host: <host_ip>
      k3s_control_node: true

k3s_worker_nodes:
  hosts:
    worker-1:
      ansible_user: <ansible_user>
      ansible_host: <host_ip>
      k3s_control_node: false
    worker-2:
      ansible_user: <ansible_user>
      ansible_host: <host_ip>
      k3s_control_node: false
      
k3s_cluster:
  hosts:
    control-1:
    control-2:
    worker-1:
    worker-2:
    worker-3:

External LB config

worker_processes 4;
worker_rlimit_nofile 40000;
include /etc/nginx/modules-enabled/*.conf;

events {
    worker_connections 8192;
}

stream {
        upstream http_servers {
                least_conn;
                server <NODE_IP>:80 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:80 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:80 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:80 max_fails=3 fail_timeout=5s;
        }

        upstream https_servers {
                least_conn;
                server <NODE_IP>:443 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:443 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:443 max_fails=3 fail_timeout=5s;
                server <NODE_IP>:443 max_fails=3 fail_timeout=5s;
        }

        upstream k3s_servers {
                server <CONTROL_NODE_IP>:6443;
                server <CONTROL_NODE_IP>:6443;
        }

        server {
                listen 80;
                proxy_pass http_servers;
        }

        server {
                listen 443;
                proxy_pass https_servers;
        }

        server {
                listen 6443;
                proxy_pass k3s_servers;
        }
}

Installing Rancher, Traefik, Cert-Manager and Longhorn

Your can go for the documentation of each solution to install, or use my ansible playbook here. First customize the values in here.


OVH Incident

After the OVH data-center @Strasbourg burner down, 3 of my nodes went down with it, 2 workers and a db replica, so out of the 4 workers I had only 2 were up (as they were hosted @Gravelin datacenter).


My first mistake was to put 2 nodes at Strasbourg and 2 at Gravelin, I should have gone for a different location for each node, since my worker nodes were also my data storage nodes. That being said, it took kubernetes some time to recover. I haven’t yet taken time to look into the k8s docs, to understand how long it waits for nodes to recover before it tries to reschedule certain workloads on available nodes. Deleting the nodes manually usually helps speed up the process.


But most of my workloads that relied on data persistence were struggling to go back up.


My second mistake was not having an odd number of worker nodes. The default longhorn replica count is 3, with only 2 available nodes the volumes were failing to mount.


Thankfully longhorn has an option allowing scheduling on the same node.


After that was ticked the missing replicas were recreated, and my workloads were back up. Not losing any data in the process.



I eventually provided 2 new worker nodes and disabled the Soft anti-affinity option, and got back to a normal setup. I also do have a safety option that is the backup to aws S3 provided by Longhorn. Which can be tricky to do, when restoring to new nodes but definitely a must-have, since S3 storage is rather cheap. I'm currently looking into the Rancher backup option as well as Velero for overkill.

k3s-ha-cluster's People

Contributors

zcubbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

k3s-ha-cluster's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.