Giter Club home page Giter Club logo

sidero's Introduction

Sidero

Caution

Sidero Labs is no longer actively developing Sidero Metal. For an alternative, please see Omni. Unless you have an existing support contract covering Sidero Metal, all support will be provided by the community (including questions in our Slack workspace).

Kubernetes Bare Metal Lifecycle Management. Sidero Metal provides lightweight, composable tools that can be used to create bare-metal Talos + Kubernetes clusters. Sidero Metal is an open-source project from Sidero Labs.

Documentation

Visit the project site.

Compatibility with Cluster API and Kubernetes Versions

This provider's versions are compatible with the following versions of Cluster API:

v1alpha3 (v0.3) v1alpha4 (v0.4) v1beta1 (v1.x)
Sidero Provider (v0.5)
Sidero Provider (v0.6)

This provider's versions are able to install and manage the following versions of Kubernetes:

v1.19 v1.20 v1.21 v1.22 v1.23 v1.24 v1.25 v1.26 v1.27 v1.28 v1.29 v1.30
Sidero Provider (v0.5)
Sidero Provider (v0.6)

This provider's versions are compatible with the following versions of Talos:

v0.12 v0.13 v0.14 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v1.7
Sidero Provider (v0.5) ✓ (+) ✓ (+)
Sidero Provider (v0.6)

Support

Join our Slack!

sidero's People

Contributors

aarnaud avatar aleksi avatar andrewrynhard avatar bzub avatar charlie-haley avatar frezbo avatar hxtmdev avatar jjgadgets avatar khuedoan avatar ksawio97 avatar kubebn avatar linuxmaniac avatar lion7 avatar lukecarrier avatar martin-sweeny avatar nathandotleeathpe avatar ogkevin avatar oscr avatar rmvangun avatar rsmitty avatar rtrox avatar sflobbe avatar smira avatar steverfrancis avatar timjones avatar ulexus avatar unix4ever avatar utkuozdemir avatar vorburger avatar zbernstein68 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sidero's Issues

bug: discovery image hangs on error

See the attached log:

[    3.537285] [sidero] Using "172.24.0.2:50100" as API endpoint
[    3.538304] [sidero] Reading SMBIOS
[   33.539339] [sidero] context deadline exceeded
[   33.540829] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[   33.542864] CPU: 1 PID: 1727 Comm: init Tainted: G        W         5.5.15-talos #1
[   33.544855] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[   33.547043] Call Trace:
[   33.547859]  dump_stack+0x66/0x90
[   33.548829]  panic+0xfc/0x2b1
[   33.549710]  do_exit.cold.25+0x14/0xbb
[   33.550761]  do_group_exit+0x35/0x90
[   33.551785]  get_signal+0x134/0x810
[   33.552789]  do_signal+0x2b/0x590
[   33.553754]  ? __x64_sys_futex+0x137/0x174
[   33.554862]  ? __set_current_blocked+0x31/0x50
[   33.555955]  ? sigprocmask+0x6d/0x90
[   33.556881]  do_syscall_64+0x17a/0x3da
[   33.557874]  ? __do_page_fault+0x269/0x4a0
[   33.558908]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   33.560111] RIP: 0033:0x46b8a3
[   33.560844] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 c
[   33.563201] RSP: 002b:000000c0000fbda8 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[   33.564286] RAX: fffffffffffffe00 RBX: 000000c00013a000 RCX: 000000000046b8a3
[   33.565200] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c00013a148
[   33.566169] RBP: 000000c0000fbdf0 R08: 0000000000000000 R09: 0000000000000000
[   33.567191] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000003
[   33.568218] R13: 000000c0000ab500 R14: 00000000008a30ec R15: 0000000000000000
[   33.569235] Kernel Offset: 0x7600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   33.570712] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Ntot sure why it can't talk back, but panic is weird, and we should reboot on panic automatically.

Use "host" instead of "server"

In the next version of Sidero (v0.2), we should rename "Server" to "Host" (still don't love it, and open to alternatives).

add test for sidero upgrades

we should add a test for upgrading from last sidero release to the sidero we built in the PR to make sure we're not breaking upgrades.

machine Spec doesn't have any IP information

Sidero:

status:
    bootstrapReady: true
    conditions:
    - lastTransitionTime: "2020-09-18T16:13:09Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2020-09-18T16:08:15Z"
      status: "True"
      type: BootstrapReady
    - lastTransitionTime: "2020-09-18T16:13:09Z"
      status: "True"
      type: InfrastructureReady
    infrastructureReady: true
    lastUpdated: "2020-09-18T16:13:09Z"
    nodeRef:
      apiVersion: v1
      kind: Node
      name: pxe-2
      uid: e6c0e820-2287-4e8d-95a1-76c1e3a541de
    observedGeneration: 3
    phase: Running

while for AWS/GCP we can have .spec.addresses:

MASTER_IP=$(${KUBECTL} --kubeconfig /tmp/e2e/docker/kubeconfig get machine -o go-template --template='{{range .status.addresses}}{{if eq .type "ExternalIP"}}{{.address}}{{end}}{{end}}' ${FIRST_CP_NODE})

Nodes start booting image when environment is created

As soon as default environment is created, nodes start booting the kernel/initramfs from the environment, which is not expected to happen.

With Talos, luckily nodes can't fetch config from metadata server, but it might lead to other consequences for different server types.

Locking for picking servers

I think some customers saw a case where a server got picked twice for two different CAPI machines. We should do a basic mutex around the selection code block to ensure this doesn't happen since our infra provider runs 10 worker threads by default.

Add phase status to servers

We could let users know a little more about the state of a server by having a PHASE field:

  • Cleaning
  • PoweredOn
  • PoweredOff
  • Failed/Failing

test: deprovisioning

This requires support from Sidero

We have indirect tests, we probably might want to verify that disk gets wiped, etc.

Ensure deletion works properly

We've seen cases where deleting the cluster hangs indefinitely. Suspect we're missing some proper owner refs or something here.

Better handling of errors and reconciliation

We don't need to return the errors that get encountered in capi providers. Instead, we should just log the error, and return ctrl.Result{}, nil. If the resource status is not ready, a redo of the reconcile gets triggered anyways. This exists in all of our providers, so we'll need to clean it up everywhere.

generated talosconfig doesn't contain endpoints

Not sure if this should be bootstrap controller project issue, but I'll report it here for easier access.

What I see in talosconfig resource:

  status:
    dataSecretName: management-cluster-cp-46k74-bootstrap-data
    ready: true
    talosConfig: |
      context: management-cluster
      contexts:
        management-cluster:
          target: ""
          ca: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJIekNCMHFBREFnRUNBaEIxemJ3OTliT3pPT0U4OUFLMU5tVkFNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBNE1UQmFGdzB6TURBNU1UWXhOakE0TVRCYU1CQXhEakFNQmdOVgpCQW9UQlhSaGJHOXpNQ293QlFZREsyVndBeUVBR3J5VmU1SGVsVU11MTRwR2FFZ0kyVzJEdGRyV0E0d0dla2NGCjBsbkNOOVdqUWpCQU1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGQlFjREFRWUkKS3dZQkJRVUhBd0l3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFGQmdNclpYQURRUUQzYlkzMkVqSnprUW9kWTZqVAo2Wlg5TXRubkdrM2dnMS9GaVd4cmEvNU96R1RnQXdhK3VzS1ZybEdpaDFjSnZzd3VxWHZyZVRScHh4N0xnU2hqClBCUUQKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
          crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJHakNCemFBREFnRUNBaEF4WkJCL2orbEVzYkR6V1FINTd2RXNNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBNE1UQmFGdzB6TURBNU1UWXhOakE0TVRCYU1Bc3hDVEFIQmdOVgpCQW9UQURBcU1BVUdBeXRsY0FNaEFKVWVXditoejJ2S2RZbDhRbUdMdlBOc0M1WU14Wm9hdEl1VkovMENyMjg3Cm8wSXdRREFPQmdOVkhROEJBZjhFQkFNQ0I0QXdIUVlEVlIwbEJCWXdGQVlJS3dZQkJRVUhBd0VHQ0NzR0FRVUYKQndNQ01BOEdBMVVkRVFRSU1BYUhCSDhBQUFFd0JRWURLMlZ3QTBFQUtmSnArcjlZdEprNHhBZm1NUFcrUmtlTwpuNGFiRDJjVUhJeEVDR2FIMlFOY1RNaDR6WHhOWnQzZi9WTEhac3lRL1dMaU0ySmdsdXIyM0d5QUJma3dDdz09Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
          key: LS0tLS1CRUdJTiBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0KTUM0Q0FRQXdCUVlESzJWd0JDSUVJRkNTUUwzZFlGY05IaDR6c2FEUWxXb3cwMldkME92REpBeGN0RzIvWXFCRwotLS0tLUVORCBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0K

First of all, target is deprecated for a long time.

Second, modern config contains endpoints, e.g.:

contexts:
    sfyra:
        endpoints:
            - 172.24.0.2
        ca: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJQakNCOGFBREFnRUNBaEJ6dXgzQS9XSkNTLzViclI5N1E5WXVNQVVHQXl0bGNEQVFNUTR3REFZRFZRUUsKRXdWMFlXeHZjekFlRncweU1EQTVNVGd4TmpBeU5UQmFGdzB6TURBNU1UWXhOakF5TlRCYU1CQXhEakFNQmdOVgpCQW9UQlhSaGJHOXpNQ293QlFZREsyVndBeUVBQ2dMaU5HSzRPVnhBeVpPSmJUMDJNTmpENXNxTlJ1QitlODdHCk5Ca3ZOaitqWVRCZk1BNEdBMVVkRHdFQi93UUVBd0lDaERBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGQlFjREFRWUkKS3dZQkJRVUhBd0l3RHdZRFZSMFRBUUgvQkFVd0F3RUIvekFkQmdOVkhRNEVGZ1FVdnZqSnUrMlhrQzBYZk5PWgp3aER6OTRKRU9BZ3dCUVlESzJWd0EwRUFBYjlJQTR6b0pIZ0RXcDMvZklLbWdacmV6elMzSGFvcFhlYWw1VVAvCmpkVzMrRGpKb1dKbWtzSkdzeE1TbmlxMUNwT1A1YmpzZHgrQnJDQ0V5eUhXQVE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
        crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJQRENCNzZBREFnRUNBaEVBckkyYmJ2TnlyU2tjaDNRRVg4OHVSakFGQmdNclpYQXdFREVPTUF3R0ExVUUKQ2hNRmRHRnNiM013SGhjTk1qQXdPVEU0TVRZd01qVXdXaGNOTXpBd09URTJNVFl3TWpVd1dqQUxNUWt3QndZRApWUVFLRXdBd0tqQUZCZ01yWlhBRElRQzF1WU5pOHBzYXBFcWpCRzhXbHZOMzhFM3U1YlBRMTlEaWhIdUo5RVhyCllLTmpNR0V3RGdZRFZSMFBBUUgvQkFRREFnZUFNQjBHQTFVZEpRUVdNQlFHQ0NzR0FRVUZCd01CQmdnckJnRUYKQlFjREFqQWZCZ05WSFNNRUdEQVdnQlMrK01tNzdaZVFMUmQ4MDVuQ0VQUDNna1E0Q0RBUEJnTlZIUkVFQ0RBRwpod1IvQUFBQk1BVUdBeXRsY0FOQkFNOWg1U1hFSzNvV0NlYzhhbWt2WXlNWHFBaCt5RXVWbzBhaStJak9yeDhwCjA0QlNRcFlYbXM1MlZpWHNqdS9TM01IWnZjZHNrSnFPNTBqM3R0a2xoZ009Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
        key: LS0tLS1CRUdJTiBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0KTUM0Q0FRQXdCUVlESzJWd0JDSUVJSjZBTmNXait2c0EyYXFQTTFDNjQ5YVlHTDFjQ2NMZ1RpdUJCTGhBNEVUbgotLS0tLUVORCBFRDI1NTE5IFBSSVZBVEUgS0VZLS0tLS0K

Without endpoints config is not usable directly

Provide configPatches on serverclasses/environments/clusters

At the moment, configPatching is available only on servers. For some of the patches, it might make more sense to have patches available in other places:

  • installDisk might be common for all servers in a class
  • installImage should be same for the whole cluster
  • extraKernelArgs might be common across the environment (e.g. enabling serial console)
  • etc.

discovery interaction should have a timeout

[    2.790890] Run /init as init process
[    2.792940] [sidero] Reading SMBIOS
[    2.793537] [sidero] Creating resource via "192.168.1.203:50100"

node is stuck like this forever

Cluster deletion hangs

Attempt to kubectl delete cluster xyz hangs at least when cluster is not fully provisioned yet, no way to clean up after failed cluster install.

Improve "in-use" checking for metadata

In the metadata server, we comb all metalmachines and look for the server UUID that requested metadata. We encountered a case where a an orphaned cluster had been created in another namespace and its metal machine still had a reference to the server UUID and it was encountered first. So metadata server failed to look up the machine. Some logs:

# kubectl -n sidero-system logs sidero-metadata-server-c5d8989b5-vmq6q -f
2020/09/11 14:52:55 received metadata request for uuid: xxx
W0911 14:52:55.289764       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020/09/11 14:52:55 failed fetching machine management-plane-cp-6c7mg in namespace default as owner of management-plane-cp-f4b44 for server xxx

Improve Metadata Server Logging

We should provide better clarity in the pod logs on where we're actually failing when delivering metadata. Currently it only looks like:

2020/09/11 14:20:06 received metadata request for uuid: 36353536-3135-5a43-3333-343956564a4e

Provide IP address in Server's status

We could grab it when server reaches out for config to the metadata server.

Why we need it?

  • server identification (easier than UUID)
  • loadbalancer for control plane needs to know IP addresses for master nodes

Support other environment resources

Looks like environments currently have a hardcoded "default" that we support. Maybe we should expose an environment ref in the server and server classes types.

Change namespace to caps

Looks like we need to align the namespace to the CAPI norms. Currently the namespace is sidero-system.

Scaling down a machinedeployment doesn't work

When scaling down a machinedeployment, I still see a metalmachine:

[N] kubectl get metalmachines -owide
NAME                          READY   CLUSTER              MACHINE                       SERVER
management-cluster-cp-blhg7   true    management-cluster   management-cluster-cp-qvnjq   28db1f01-7ab8-44a6-a15b-48fe2326151a

Configuration controller

With the new ApplyConfiguration API in Talos, we could provide controller like semantics around config management.

Guide for post-pivot

There are some things that need to be done after pivoting. First that comes to mind is editing the default environment to update the $PUBLIC_IP. Should take note of these and add them to the "creating your first cluster" guide.

Dynamic metadata endpoint

We might be able to detect that the metadata service is expose on some node via host network. By default we could return the metadata service IP, which would cover the case of BGP. Then, if we detect that the service is exposed on some node port, we can return that instead. This would make environments much easier to manage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.