siderolabs / sidero Goto Github PK

View Code? Open in Web Editor NEW

392.0 14.0 64.0 20.71 MB

Sidero Metal is a bare metal provisioning system with support for Kubernetes Cluster API.

Home Page: https://www.sidero.dev

License: Mozilla Public License 2.0

Dockerfile 2.30% Makefile 1.10% Go 65.80% Shell 0.57% JavaScript 12.28% CSS 0.82% SCSS 6.02% HTML 11.12%

baremetal baremetal-provisioning kubernetes

sidero's Introduction

Sidero

Caution

Sidero Labs is no longer actively developing Sidero Metal. For an alternative, please see Omni. Unless you have an existing support contract covering Sidero Metal, all support will be provided by the community (including questions in our Slack workspace).

Kubernetes Bare Metal Lifecycle Management. Sidero Metal provides lightweight, composable tools that can be used to create bare-metal Talos + Kubernetes clusters. Sidero Metal is an open-source project from Sidero Labs.

Documentation

Visit the project site.

Compatibility with Cluster API and Kubernetes Versions

This provider's versions are compatible with the following versions of Cluster API:

	v1alpha3 (v0.3)	v1alpha4 (v0.4)	v1beta1 (v1.x)
Sidero Provider (v0.5)			✓
Sidero Provider (v0.6)			✓

This provider's versions are able to install and manage the following versions of Kubernetes:

	v1.19	v1.20	v1.21	v1.22	v1.23	v1.24	v1.25	v1.26	v1.27	v1.28	v1.29	v1.30
Sidero Provider (v0.5)	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sidero Provider (v0.6)						✓	✓	✓	✓	✓	✓	✓

This provider's versions are compatible with the following versions of Talos:

	v0.12	v0.13	v0.14	v1.0	v1.1	v1.2	v1.3	v1.4	v1.5	v1.6	v1.7
Sidero Provider (v0.5)	✓ (+)	✓ (+)	✓	✓	✓	✓	✓
Sidero Provider (v0.6)				✓	✓	✓	✓	✓	✓	✓	✓

Support

Join our Slack!

sidero's People

Contributors

Stargazers

Watchers

sidero's Issues

bug: discovery image hangs on error

See the attached log:

[    3.537285] [sidero] Using "172.24.0.2:50100" as API endpoint
[    3.538304] [sidero] Reading SMBIOS
[   33.539339] [sidero] context deadline exceeded
[   33.540829] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[   33.542864] CPU: 1 PID: 1727 Comm: init Tainted: G        W         5.5.15-talos #1
[   33.544855] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[   33.547043] Call Trace:
[   33.547859]  dump_stack+0x66/0x90
[   33.548829]  panic+0xfc/0x2b1
[   33.549710]  do_exit.cold.25+0x14/0xbb
[   33.550761]  do_group_exit+0x35/0x90
[   33.551785]  get_signal+0x134/0x810
[   33.552789]  do_signal+0x2b/0x590
[   33.553754]  ? __x64_sys_futex+0x137/0x174
[   33.554862]  ? __set_current_blocked+0x31/0x50
[   33.555955]  ? sigprocmask+0x6d/0x90
[   33.556881]  do_syscall_64+0x17a/0x3da
[   33.557874]  ? __do_page_fault+0x269/0x4a0
[   33.558908]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   33.560111] RIP: 0033:0x46b8a3
[   33.560844] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 c
[   33.563201] RSP: 002b:000000c0000fbda8 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[   33.564286] RAX: fffffffffffffe00 RBX: 000000c00013a000 RCX: 000000000046b8a3
[   33.565200] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c00013a148
[   33.566169] RBP: 000000c0000fbdf0 R08: 0000000000000000 R09: 0000000000000000
[   33.567191] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000003
[   33.568218] R13: 000000c0000ab500 R14: 00000000008a30ec R15: 0000000000000000
[   33.569235] Kernel Offset: 0x7600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   33.570712] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Ntot sure why it can't talk back, but panic is weird, and we should reboot on panic automatically.

test: deprovisioning

Server is not powered off after the discovery

@rsmitty said that it should be the case, but I don't see the code for it, and it doesn't seem to work with Sfyra

Use "host" instead of "server"

In the next version of Sidero (v0.2), we should rename "Server" to "Host" (still don't love it, and open to alternatives).

add test for sidero upgrades

we should add a test for upgrading from last sidero release to the sidero we built in the PR to make sure we're not breaking upgrades.

test: more testcases for the environment controller

environment controller -> download assets, updates status, test error scenarios (missing SHA512?)

test: create workload cluster

Using CAPI from #87, create workload cluster (or many clusters)

Guide for pre-bootstrap

Need to document DHCP setup (maybe just add to bootstrap docs).

test: pivot CAPI to the management cluster

See siderolabs/sfyra#6: pivot to the new cluster.

Verify that all the resources got transferred correctly.

machine Spec doesn't have any IP information

Sidero:

status:
    bootstrapReady: true
    conditions:
    - lastTransitionTime: "2020-09-18T16:13:09Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2020-09-18T16:08:15Z"
      status: "True"
      type: BootstrapReady
    - lastTransitionTime: "2020-09-18T16:13:09Z"
      status: "True"
      type: InfrastructureReady
    infrastructureReady: true
    lastUpdated: "2020-09-18T16:13:09Z"
    nodeRef:
      apiVersion: v1
      kind: Node
      name: pxe-2
      uid: e6c0e820-2287-4e8d-95a1-76c1e3a541de
    observedGeneration: 3
    phase: Running

while for AWS/GCP we can have .spec.addresses:

MASTER_IP=$(${KUBECTL} --kubeconfig /tmp/e2e/docker/kubeconfig get machine -o go-template --template='{{range .status.addresses}}{{if eq .type "ExternalIP"}}{{.address}}{{end}}{{end}}' ${FIRST_CP_NODE})

Nodes start booting image when environment is created

As soon as default environment is created, nodes start booting the kernel/initramfs from the environment, which is not expected to happen.

With Talos, luckily nodes can't fetch config from metadata server, but it might lead to other consequences for different server types.

Create your first cluster guide

What to do after bootstrapping and pivoting. Should be pretty basic.

Unit tests: fix or remove

Locking for picking servers

I think some customers saw a case where a server got picked twice for two different CAPI machines. We should do a basic mutex around the selection code block to ensure this doesn't happen since our infra provider runs 10 worker threads by default.

Add phase status to servers

We could let users know a little more about the state of a server by having a PHASE field:

Cleaning
PoweredOn
PoweredOff
Failed/Failing

App Labels Incorrect

All app labels are "app: sidero". We need to be more specific.

test: concurrent cluster creation

verify that no server is used twice

test: deprovisioning

This requires support from Sidero

We have indirect tests, we probably might want to verify that disk gets wiped, etc.

Reset nodes on delete

Move to Sfyra to Sidero

Ensure deletion works properly

We've seen cases where deleting the cluster hangs indefinitely. Suspect we're missing some proper owner refs or something here.

Figure out availability/failure zones

We should provide a way for easy creation/management of failure zones.

Better handling of errors and reconciliation

We don't need to return the errors that get encountered in capi providers. Instead, we should just log the error, and return ctrl.Result{}, nil. If the resource status is not ready, a redo of the reconcile gets triggered anyways. This exists in all of our providers, so we'll need to clean it up everywhere.

Add support for partial matching in qualifiers

I want to partially match on CPU, and SystemInformation structs.

generated talosconfig doesn't contain endpoints

Not sure if this should be bootstrap controller project issue, but I'll report it here for easier access.

What I see in talosconfig resource:

  status:
    dataSecretName: management-cluster-cp-46k74-bootstrap-data
    ready: true
    talosConfig: |
      context: management-cluster
      contexts:
        management-cluster:
          target: ""
          ca: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJIekNCMHFBREFnRUNBaEIxemJ3OTliT3pPT0U4OUFLMU5tVkFNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBNE1UQmFGdzB6TURBNU1UWXhOakE0TVRCYU1CQXhEakFNQmdOVgpCQW9UQlhSaGJHOXpNQ293QlFZREsyVndBeUVBR3J5VmU1SGVsVU11MTRwR2FFZ0kyVzJEdGRyV0E0d0dla2NGCjBsbkNOOVdqUWpCQU1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGQlFjREFRWUkKS3dZQkJRVUhBd0l3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFGQmdNclpYQURRUUQzYlkzMkVqSnprUW9kWTZqVAo2Wlg5TXRubkdrM2dnMS9GaVd4cmEvNU96R1RnQXdhK3VzS1ZybEdpaDFjSnZzd3VxWHZyZVRScHh4N0xnU2hqClBCUUQKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
          crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJHakNCemFBREFnRUNBaEF4WkJCL2orbEVzYkR6V1FINTd2RXNNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBNE1UQmFGdzB6TURBNU1UWXhOakE0TVRCYU1Bc3hDVEFIQmdOVgpCQW9UQURBcU1BVUdBeXRsY0FNaEFKVWVXditoejJ2S2RZbDhRbUdMdlBOc0M1WU14Wm9hdEl1VkovMENyMjg3Cm8wSXdRREFPQmdOVkhROEJBZjhFQkFNQ0I0QXdIUVlEVlIwbEJCWXdGQVlJS3dZQkJRVUhBd0VHQ0NzR0FRVUYKQndNQ01BOEdBMVVkRVFRSU1BYUhCSDhBQUFFd0JRWURLMlZ3QTBFQUtmSnArcjlZdEprNHhBZm1NUFcrUmtlTwpuNGFiRDJjVUhJeEVDR2FIMlFOY1RNaDR6WHhOWnQzZi9WTEhac3lRL1dMaU0ySmdsdXIyM0d5QUJma3dDdz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
          key: LS0tLS1CRUdJTiBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0KTUM0Q0FRQXdCUVlESzJWd0JDSUVJRkNTUUwzZFlGY05IaDR6c2FEUWxXb3cwMldkME92REpBeGN0RzIvWXFCRwotLS0tLUVORCBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0K

First of all, target is deprecated for a long time.

Second, modern config contains endpoints, e.g.:

contexts:
    sfyra:
        endpoints:
            - 172.24.0.2
        ca: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJQakNCOGFBREFnRUNBaEJ6dXgzQS9XSkNTLzViclI5N1E5WXVNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBeU5UQmFGdzB6TURBNU1UWXhOakF5TlRCYU1CQXhEakFNQmdOVgpCQW9UQlhSaGJHOXpNQ293QlFZREsyVndBeUVBQ2dMaU5HSzRPVnhBeVpPSmJUMDJNTmpENXNxTlJ1QitlODdHCk5Ca3ZOaitqWVRCZk1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGQlFjREFRWUkKS3dZQkJRVUhBd0l3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFkQmdOVkhRNEVGZ1FVdnZqSnUrMlhrQzBYZk5PWgp3aER6OTRKRU9BZ3dCUVlESzJWd0EwRUFBYjlJQTR6b0pIZ0RXcDMvZklLbWdacmV6elMzSGFvcFhlYWw1VVAvCmpkVzMrRGpKb1dKbWtzSkdzeE1TbmlxMUNwT1A1YmpzZHgrQnJDQ0V5eUhXQVE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
        crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJQRENCNzZBREFnRUNBaEVBckkyYmJ2TnlyU2tjaDNRRVg4OHVSakFGQmdNclpYQXdFREVPTUF3R0ExVUUKQ2hNRmRHRnNiM013SGhjTk1qQXdPVEU0TVRZd01qVXdXaGNOTXpBd09URTJNVFl3TWpVd1dqQUxNUWt3QndZRApWUVFLRXdBd0tqQUZCZ01yWlhBRElRQzF1WU5pOHBzYXBFcWpCRzhXbHZOMzhFM3U1YlBRMTlEaWhIdUo5RVhyCllLTmpNR0V3RGdZRFZSMFBBUUgvQkFRREFnZUFNQjBHQTFVZEpRUVdNQlFHQ0NzR0FRVUZCd01CQmdnckJnRUYKQlFjREFqQWZCZ05WSFNNRUdEQVdnQlMrK01tNzdaZVFMUmQ4MDVuQ0VQUDNna1E0Q0RBUEJnTlZIUkVFQ0RBRwpod1IvQUFBQk1BVUdBeXRsY0FOQkFNOWg1U1hFSzNvV0NlYzhhbWt2WXlNWHFBaCt5RXVWbzBhaStJak9yeDhwCjA0QlNRcFlYbXM1MlZpWHNqdS9TM01IWnZjZHNrSnFPNTBqM3R0a2xoZ009Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
        key: LS0tLS1CRUdJTiBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0KTUM0Q0FRQXdCUVlESzJWd0JDSUVJSjZBTmNXait2c0EyYXFQTTFDNjQ5YVlHTDFjQ2NMZ1RpdUJCTGhBNEVUbgotLS0tLUVORCBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0K

Without endpoints config is not usable directly

Provide configPatches on serverclasses/environments/clusters

At the moment, configPatching is available only on servers. For some of the patches, it might make more sense to have patches available in other places:

installDisk might be common for all servers in a class
installImage should be same for the whole cluster
extraKernelArgs might be common across the environment (e.g. enabling serial console)
etc.

discovery interaction should have a timeout

[    2.790890] Run /init as init process
[    2.792940] [sidero] Reading SMBIOS
[    2.793537] [sidero] Creating resource via "192.168.1.203:50100"

node is stuck like this forever

test: different cluster sizes

different cluster sizes - control plane size, worker nodes

Ensure proper delineation between CAPI provider and metal controller

Refactor the CAPI provider and the metal controller manager such that each has clearer boundaries when updating servers.

Cluster deletion hangs

Attempt to kubectl delete cluster xyz hangs at least when cluster is not fully provisioned yet, no way to clean up after failed cluster install.

Improve "in-use" checking for metadata

In the metadata server, we comb all metalmachines and look for the server UUID that requested metadata. We encountered a case where a an orphaned cluster had been created in another namespace and its metal machine still had a reference to the server UUID and it was encountered first. So metadata server failed to look up the machine. Some logs:

# kubectl -n sidero-system logs sidero-metadata-server-c5d8989b5-vmq6q -f
2020/09/11 14:52:55 received metadata request for uuid: xxx
W0911 14:52:55.289764       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020/09/11 14:52:55 failed fetching machine management-plane-cp-6c7mg in namespace default as owner of management-plane-cp-f4b44 for server xxx

Improve Metadata Server Logging

We should provide better clarity in the pod logs on where we're actually failing when delivering metadata. Currently it only looks like:

2020/09/11 14:20:06 received metadata request for uuid: 36353536-3135-5a43-3333-343956564a4e

[metadata server] separate http handling from biz logic

We should separate the two so that the main method can be testable outside of HTTP handler context.

Add option for sending node's hostname at discovery time

It would be nice if the servers could also register their hostnames into the server CR.

Provide IP address in Server's status

We could grab it when server reaches out for config to the metadata server.

Why we need it?

server identification (easier than UUID)
loadbalancer for control plane needs to know IP addresses for master nodes

bug: Cluster delete doesn't reset servers

Support other environment resources

Looks like environments currently have a hardcoded "default" that we support. Maybe we should expose an environment ref in the server and server classes types.

automatically set the install disk for Talos install config

If we enumerate block devices in the agent, we could automatically set the install disk to be the first (?) block device if no blockdevice was set explicitly by the user.

[metadata server] break main apart into functions

The metadata server code is nigh impossible to read. It should really be broken up to be a bit more modular and readable.

Change namespace to caps

Looks like we need to align the namespace to the CAPI norms. Currently the namespace is sidero-system.

docs: Document all CRDs

Things like MetalMachineTemplate, and TalosControlPlane.

testing

testing ghub projects

test: scaling cluster up & down

(workers & control plane)

metadata server: test config patching

Scaling down a machinedeployment doesn't work

When scaling down a machinedeployment, I still see a metalmachine:

[N] kubectl get metalmachines -owide
NAME                          READY   CLUSTER              MACHINE                       SERVER
management-cluster-cp-blhg7   true    management-cluster   management-cluster-cp-qvnjq   28db1f01-7ab8-44a6-a15b-48fe2326151a

Configuration controller

With the new ApplyConfiguration API in Talos, we could provide controller like semantics around config management.

test: custom SMBIOS settings

Requires support from Talos provision library, verify that new settings got discovered and set correctly.

Guide for post-pivot

There are some things that need to be done after pivoting. First that comes to mind is editing the default environment to update the $PUBLIC_IP. Should take note of these and add them to the "creating your first cluster" guide.

Sfyra upgrade test for Sidero components

Install some release version, upgrade to current version (or another release version)

https://godoc.org/sigs.k8s.io/cluster-api/cmd/clusterctl/client#Client

Dynamic metadata endpoint

We might be able to detect that the metadata service is expose on some node via host network. By default we could return the metadata service IP, which would cover the case of BGP. Then, if we detect that the service is exposed on some node port, we can return that instead. This would make environments much easier to manage.