/kind bug I setup a cluster with BareMetalHost using hetzner-hcloud-

I'm not sure what happens when you move the cluster. <a class="user-mention notranslat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This one here shows that cloud-init has a problem <code

cluster with baremetal host waiting for available machine,about syself/cluster-api-provider-hetzner

Comments (24)

janiskemper commented on June 18, 2024 1

I'm not sure what happens when you move the cluster. @batistein should answer then

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024 1

Okay, please watch the host object, especially when it comes to the moment when it is stuck in provisioning and sends all the reset requests.
And then there are logs that concern this specific host object in the CAPH controller. That would be interesting :)

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

@radimk check this

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

@Privatecoder: thanks. I think my problem is different. At least I don't see any 'failed to set ssh key' errors. The machine is provisioned for a while but it is unhealthy and doesn't start the process of joining the cluster. The fact that call from caph -> load balancer -> control plane shows error is suspicious. But maybe it is not the root cause and I need to look somewhere else.
After an hour timeout expires and the machine start reboot and bootstrap again under new name.

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

@radimk can you check whether the Hetzner loadbalancer has the correct targets (also the bare metal servers)?
And then, can you please check the status of cloud-init on the server with cloud-init status and see whether there is something interesting in /var/log/cloud-init-output.log?

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

@janiskemper loadbalancers points only to control-plane node. cloud-init status says status: done and log file is like this:

Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Thu, 03 Nov 2022 15:13:57 +0000. Up 5.22 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Thu, 03 Nov 2022 15:14:14 +0000. Up 22.65 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: |   Device  |  Up  |           Address           |       Mask      | Scope  |     Hw-Address    |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | enp0s31f6 | True |         46.4.66.173         | 255.255.255.255 | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True |   2a01:4f8:140:2484::2/64   |        .        | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | fe80::921b:eff:fecc:ce73/64 |        .        |  link  | 90:1b:0e:cc:ce:73 |
ci-info: |     lo    | True |          127.0.0.1          |    255.0.0.0    |  host  |         .         |
ci-info: |     lo    | True |           ::1/128           |        .        |  host  |         .         |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | Route | Destination |   Gateway   | Genmask | Interface | Flags |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 46.4.66.129 | 0.0.0.0 | enp0s31f6 |   UG  |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: ++++++++++++++++++++++++Route IPv6 info+++++++++++++++++++++++++
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | Route |      Destination       | Gateway | Interface | Flags |
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: |   1   | 2a01:4f8:140:2484::/64 |    ::   | enp0s31f6 |   U   |
ci-info: |   2   |       fe80::/64        |    ::   | enp0s31f6 |   U   |
ci-info: |   3   |          ::/0          | fe80::1 | enp0s31f6 |   UG  |
ci-info: |   5   |         local          |    ::   | enp0s31f6 |   U   |
ci-info: |   6   |         local          |    ::   | enp0s31f6 |   U   |
ci-info: |   7   |       multicast        |    ::   | enp0s31f6 |   U   |
ci-info: +-------+------------------------+---------+-----------+-------+
2022-11-03 15:14:14,746 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:14:15 +0000. Up 23.02 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:14:15 +0000. Up 23.27 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:14:15 +0000. Datasource DataSourceNone.  Up 23.35 seconds
2022-11-03 15:14:15,367 - cc_final_message.py[WARNING]: Used fallback datasource
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Thu, 03 Nov 2022 15:15:08 +0000. Up 5.02 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.12 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: |   Device  |  Up  |           Address           |       Mask      | Scope  |     Hw-Address    |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | enp0s31f6 | True |         46.4.66.173         | 255.255.255.255 | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True |   2a01:4f8:140:2484::2/64   |        .        | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | fe80::921b:eff:fecc:ce73/64 |        .        |  link  | 90:1b:0e:cc:ce:73 |
ci-info: |     lo    | True |          127.0.0.1          |    255.0.0.0    |  host  |         .         |
ci-info: |     lo    | True |           ::1/128           |        .        |  host  |         .         |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | Route | Destination |   Gateway   | Genmask | Interface | Flags |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 46.4.66.129 | 0.0.0.0 | enp0s31f6 |   UG  |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: ++++++++++++++++++++++++Route IPv6 info+++++++++++++++++++++++++
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | Route |      Destination       | Gateway | Interface | Flags |
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: |   1   | 2a01:4f8:140:2484::/64 |    ::   | enp0s31f6 |   U   |
ci-info: |   2   |       fe80::/64        |    ::   | enp0s31f6 |   U   |
ci-info: |   3   |          ::/0          | fe80::1 | enp0s31f6 |   UG  |
ci-info: |   5   |         local          |    ::   | enp0s31f6 |   U   |
ci-info: |   6   |         local          |    ::   | enp0s31f6 |   U   |
ci-info: |   7   |       multicast        |    ::   | enp0s31f6 |   U   |
ci-info: +-------+------------------------+---------+-----------+-------+
2022-11-03 15:15:26,227 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.47 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.73 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:15:26 +0000. Datasource DataSourceNone.  Up 23.81 seconds
2022-11-03 15:15:26,833 - cc_final_message.py[WARNING]: Used fallback datasource

Digging more on given warning

root@bm-g-work-md-1-kh22c ~ # cloud-init schema --system
Cloud config schema deprecations: 
Error:
Cloud config schema errors: users.0: {'gecos': 'ubuntu', 'groups': {'adm', 'lxd', 'plugdev', 'sudo', 'dip', 'cdrom'}, 'lock_passwd': False, 'name': 'ubuntu', 'passwd': '...', 'shell': '/bin/bash'} is not valid under any of the given schemas

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

There are a few errors in CAPH logs. Error when accessing API group resources. It calls load balancer that is always pointing to HCloud control plane node (not HCloud node if I add it, not bare metal server). The second type shows an error failed to save host status after \"registering\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\" or the same after "provisioning". Later there is also "error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"provisioning\" failed: failed to handle incomplete boot - installImage: unexpected hostname bm-g-work-md-1-pmkq7","errorVerbose":"unexpected hostname bm-g-work-md-1-pmkq7\nfailed to handle incomplete boot - installImage

{"level":"ERROR","time":"2022-11-03T18:39:56.237Z","file":"[email protected]/logr.go:279","message":"Failed to get API Group-Resources","error":"Get \"https://49.12.21.105:443/api?timeout=32s\": EOF","stacktrace":"github.com/go-logr/logr.Logger.Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:279\nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:322\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).newTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:582\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:514\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"ERROR","time":"2022-11-03T18:39:56.238Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznercluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerCluster","hetznerCluster":{"name":"g-work","namespace":"default"},"namespace":"default","name":"g-work","reconcileID":"5f143185-9dd0-42a1-8be3-2dd0b8fc4c8c","error":"failed to reconcile target cluster manager: failed to create a clusterManager for HetznerCluster default/g-work: failed to setup guest cluster manager: Get \"https://49.12.21.105:443/api?timeout=32s\": EOF","errorVerbose":"Get \"https://49.12.21.105:443/api?timeout=32s\": EOF\nfailed to setup guest cluster manager\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).newTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:591\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:514\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to create a clusterManager for HetznerCluster default/g-work\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:516\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to reconcile target cluster manager\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:222\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}	


{"level":"ERROR","time":"2022-11-03T18:43:15.016Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"e1e5e31b-e4ca-48b2-9767-ca5df59f1b1f","error":"failed to reconcile HetznerBareMetalHost default/bm-0: failed to save host status after \"registering\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again\nfailed to update status\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.saveHost\n\t/workspace/pkg/services/baremetal/host/host.go:132\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).Reconcile\n\t/workspace/pkg/services/baremetal/host/host.go:86\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:146\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to save host status after \"registering\"\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).Reconcile\n\t/workspace/pkg/services/baremetal/host/host.go:87\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:146\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to reconcile HetznerBareMetalHost default/bm-0\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:147\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}

Another thing that puzzles me is the address shown in kubectl describe HetznerBareMetalMachine that is x.x.x.x/26 and not x.x.x.x/32. Any idea why is that?

Status:
  Addresses:
    Address:  46.4.66.173/26

from cluster-api-provider-hetzner.

batistein commented on June 18, 2024

One minor thing - but important for error replication: You are using a different OS Type. Please try again with the default configuration to see if this error is gone then.

In general we have to distinguish if the issue is an error in the controller or in the configuration. The configuration templates we provide in the releases are just a starting point and should be adjusted to your needs. By changing the configuration errors may occur as a result of this, and a Kubernetes cluster may not work. Unfortunately these sort of errors are not within the scope of this project and must be solved by the users themselves. To support all of us on the path of configuration, I think this would be then properly placed in the discussion section in the first place....

Therefore I would be currently focus on the fact if this can be caused by the machine itself as this would be in the scope. (remember cluster-api only handles machine lifecycle as well as the necessary infrastructure and provides only interfaces/api for Kubernetes configuration).

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

I tried it again yesterday with installImage set to /root/images/Ubuntu-2004-focal-64-minimal-hwe.tar.gz to no avail. And I added one HCloud machine as a worker node to see if it would be configured. The result is that control-plane node is up, HCloud node is up, but bare-metal is not. In my recent tries it only shows these conditions:

  Conditions:
    Last Transition Time:  2022-11-03T18:43:08Z
    Status:                True
    Type:                  AssociateBMHCondition
    Last Transition Time:  2022-11-03T18:43:08Z
    Status:                True
    Type:                  InstanceBootstrapReady
  Last Updated:            2022-11-03T18:44:44Z

It doesn't reach InstanceReady.

Hetzner load balancer is set up and it always point to control-plane only.

What I see looks like this output below. g-work is the one I want to create with bare-metal, g-mgmt was created before from local kind and its config was moved. SSH should use key that works for that machine (I can log using this key from my machine).

$ kubectl get hetznerbaremetalmachines,hcloudmachines,hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io
NAME                                                                        AGE   PROVIDERID   READY   CLUSTER
hetznerbaremetalmachine.infrastructure.cluster.x-k8s.io/g-work-md-1-j7fdc   99m                        g-work

NAME                                                                       CLUSTER   IMAGE                                                 PLACEMENT GROUP   TYPE   STATE     READY   INSTANCEID          MACHINE
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-control-plane-8snwk   g-mgmt    1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856   control-plane     cx21   running   true    hcloud://25200385   g-mgmt-control-plane-wvlbn
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-md-0-kqqqz            g-mgmt    1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856   md-0              cx21   running   true    hcloud://25200426   g-mgmt-md-0-77c6c445fd-bfl7n
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-md-0-l8c8t            g-mgmt    1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856   md-0              cx21   running   true    hcloud://25200427   g-mgmt-md-0-77c6c445fd-nftzg
hcloudmachine.infrastructure.cluster.x-k8s.io/g-work-control-plane-5qfvp   g-work    1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856   control-plane     cx21   running   true    hcloud://25261806   g-work-control-plane-plnc7
hcloudmachine.infrastructure.cluster.x-k8s.io/g-work-md-0-zph4k            g-work    1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856   md-0              cx21   running   true    hcloud://25261867   g-work-md-0-7d56f46cb6-5jfjw

NAME                                                        STATE          IPV4          IPV6                   THREADS   CLOCK SPEED   RAM IN GB   CONSUMER            ERRORTYPE                                 ERRORMESSAGE                AGE
hetznerbaremetalhost.infrastructure.cluster.x-k8s.io/bm-0   provisioning   46.4.66.173   2a01:4f8:140:2484::1   8         4.2           64          g-work-md-1-j7fdc   connection refused error of SSH command   ssh gave connection error   99m

from cluster-api-provider-hetzner.

batistein commented on June 18, 2024

This one here shows that cloud-init has a problem

2022-11-03 15:15:26,227 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.47 seconds. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.73 seconds. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:15:26 +0000. Datasource DataSourceNone. Up 23.81 seconds 2022-11-03 15:15:26,833 - cc_final_message.py[WARNING]: Used fallback datasource

You could also double check this by trying crictl ps I guess there are no containers coming up.

A quick google research led me to this: https://discourse.ubuntu.com/t/release-of-cloud-init-22-2/28471/5
Maybe that helps?

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

TL;DR I was able to initialize the cluster with some manual tricks. Some questions are still open for me. And I am curious what will happen when I try to add second node. Thanks for your patience.

I looked at Hetzner Robot UI to see what is happening with the server and I see always several reset requests and then long wait until 60m timeout expires and CAHP tries the sequence again. Is it expected that reset are send repeatedly?

I did last try yesterday after 22pm and that run didn't proceed to cloud-init setup https://github.com/syself/cluster-api-provider-hetzner/blob/main/pkg/services/baremetal/host/host.go#L1190 because it got stuck in previous step.
Start 22:13 with deactivate & active rescue, reset. Change name at 22:16. Then 'image-installing' should be done:

{"level":"INFO","time":"2022-11-04T21:16:10.990Z","file":"host/state_machine.go:67","message":"changing provisioning state","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name"
:"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"672b9c67-37ed-4b33-b4d9-369a517099ff","old":"image-installing","new":"provisioning"}

Failed attempt to SSH to that machine (it is not fully booted yet accoring to its syslog) and shortly after the reconciler fails to store new status 'provisioning'. Is it possible that there is some race condition (due failed SSH attempt or something else)? Is it a problem or not?

{"level":"ERROR","time":"2022-11-04T21:16:16.592Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0
","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"76e83f9d-f0d5-4aa7-8a6a-487c0a16a9df","error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"image-installing\" failed: failed to perform ssh command: failed to dial ssh. Error message: dial tcp 46.4.66.173:22: i/o timeout. DialErr: failed to dial ssh","errorVerbose":"failed to dial ssh. Error message: dial tcp 46.4.66.173:22: i/o timeout. DialErr: failed to dial ssh\nfailed to perform ssh command
...
{"level":"ERROR","time":"2022-11-04T21:16:26.817Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"fae68336-e896-449a-ac4a-2f5507de9ccb","error":"failed to reconcile HetznerBareMetalHost default/bm-0: failed to save host status after \"provisioning\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again\nfailed to update status\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.saveHost

Machine rebooted at 22:17pm and auth.log shows a dozen of SSH sessions from cluster running ClusterAPI/CAHP and those can be corelated with errors about wrong machine name. AFAICT CAHP connects and sees unexpected machine name so it disconnects immediately.

Nov  4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe sshd[680]: Server listening on 0.0.0.0 port 22.
Nov  4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe sshd[680]: Server listening on :: port 22.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: Accepted publickey for root from 167.235.203.98 port 51670 ssh2: RSA SHA256:RHeXEU88/67k8QYdUqUtNB7MTe5HoKUZKP/40QmYPlY
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: New session 1 of user root.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: pam_unix(sshd:session): session closed for user root
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Session 1 logged out. Waiting for processes to exit.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Removed session 1.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: Accepted publickey for root from 167.235.203.98 port 51672 ssh2: RSA SHA256:RHeXEU88/67k8QYdUqUtNB7MTe5HoKUZKP/40QmYPlY
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: New session 3 of user root.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: pam_unix(sshd:session): session closed for user root
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Session 3 logged out. Waiting for processes to exit.
Nov  4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Removed session 3.

{"level":"ERROR","time":"2022-11-04T21:17:31.213Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"ea9bb318-1309-4b46-85a9-905f59d0f998","error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"provisioning\" failed: failed to handle incomplete boot - installImage: unexpected hostname Ubuntu-2004-focal-64-minimal-hwe","errorVerbose":"unexpected hostname Ubuntu-2004-focal-64-minimal-hwe\nfailed to handle incomplete boot - installImage\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).actionProvisioning

BTW: the name is set during startup to a generic one:

/var/log/syslog.1:Nov  4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe systemd-resolved[616]: Using system hostname 'Ubuntu-2004-focal-64-minimal-hwe'.
/var/log/syslog.1:Nov  4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe kernel: [   34.082350] systemd[1]: Set hostname to <Ubuntu-2004-focal-64-minimal-hwe>.

Knowing this I made another attempt and manually patched hostname. The host and machine is ready but CAPI doesn't think it is healthy and logs these errors:

E1105 21:40:31.824505       1 machine_controller_noderef.go:192] "Failed to parse ProviderID" err="providerID is empty" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="default/g-work-md-1-5bd54bfd6c-962dt" namespace="default" name="g-work-md-1-5bd54bfd6c-962dt" reconcileID=4d8d53e8-f1df-442c-a84f-7da0b7a5c3f2 providerID="hcloud://bm-1702753" node="/bm-g-work-md-1-6fnbp"

I re-deployed Hetzner CCM again KUBECONFIG=$CAPH_WORKER_CLUSTER_KUBECONFIG helm upgrade --install ccm syself/ccm-hetzner --version 1.1.4 --namespace kube-system --set privateNetwork.enabled=false and voila. It is up running.

from cluster-api-provider-hetzner.

batistein commented on June 18, 2024

@radimk It looks like that the hostname is not set up correctly via installimage. The process is the following:

booting into rescue mode
creating and executing the autosetup file. This runs installimage
reboot and testing if we are connecting to the provisioned machine. Here we check the hostname. If it's not correct the controller thinks that there is an error.
If the hostname is correct it proceeds with transfering the cloud-init configuration to the node. And executing it. The controller also tests if cloud-init was successful.
The node should then show up in k8s. Now the ccm needs to initialize it. This is an independent controll loop. But as the machineHealthChecks are watching the node object, beeing stuck here, could also leads into a full recreation.

If you have configured MachineHealthChecks all of the above is done in the specified time frame. If the time frame is exceeded capi will delete the machine object and creates a new one so the process starts again.

from cluster-api-provider-hetzner.

radimk commented on June 18, 2024

I've been able to create a cluster with bare metal machine using fresh kind cluster for its bootstraping. Later I added another bare-metal machine and it works well. I accidentally removed 2nd machine by changing replica count in g-work-md-1 and added it back. IMO we can close this issue. Thanks for your work.

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

Thanks @radimk! I will close it now.

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

@janiskemper

I'm facing the same issues as @radimk here.

I created the same cluster dozens of times (three bm-controlplanes (k8s 1.24.7 and 1.24.7, Ubuntu 20.04 HWE default), three bm-workers (Ubuntu 20.04 HWE default), four hcloud-workers (Ubuntu 22.04 default), hetzner-baremetal-control-planes-remediation flavor, v1.0.0-beta.6 and v1.0.0-beta.7) always with a fresh kind-bootstrap cluster and fresh HCloud-Project but it never initializes without errors.

In addition to the route-issue I experienced the same rate-limit error for bm-reset through the robot-api.

The first bm-controlplane is always fine. For the second or third one – during cluster creation – the controller will send endless reset-calls until the api locks due to the rate-limit.

As @radimk I was also able to initialize the cluster with some manual tricks, for example by removing and re-adding the HetznerBareMetalHost-object of the affected machine, ssh'ing into it (which should have been reset but wasn't) and triggering a manual reboot).

Also – after manually fixing things like routing and the rebooting issue – the CAPH-controller always logs "secret has been modified although a provisioned machine uses it","controller":"hetznerbaremetalhost" for all HetznerBareMetalHosts.

The cluster is going to be used for production but only after I am sure it can safely be re-created without manual-intervention so I am happy to re-create it as many times as possibly required to help and debug.

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

What exactly do you mean by "errors"? The status of the host machine shows things like "cannot SSH into it" etc. This is not an error, but is merely used to track the progress.

The reset calls are sent if the controller thinks it is required. Could you maybe give us the exact status of the host object when it sends these requests in addition to the logs that come from that machine? Otherwise, it is hard to say.

And this thing with the secret: If you use a secret to provision a machine and then change it, obviously this isn't possible. The only option to recover from that is reprovisioning all machines that use that secret. This is just a warning for you that you shouldn't do this!

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

The first "error" I found was that the second controlplane could not join the cluster as cloud-init had timeout-errors. This was due to the wrong route-settings.

The next "error" was the infinite call of the robot-api until it locked due to the rate-limit (3600, so a lot of bm-reboot calls have been made – I did not investigate further / deeper – whenever this happened I killed the kind-bootstrap cluster to prevent further calls and therefore the Hetzner email-flood).

Another "error" was the mentioned "secret has been modified although a provisioned machine uses it","controller":"hetznerbaremetalhost" of the CAPH-controller.

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

The first error we'll investigate in the other issue (@batistein, as I'm not a networking expert).

The second error would be interesting to investigate upon, but there I need, as mentioned above, the logs of the machine and the relevant statuses of the host during provisioning.

The last error should only happen if the secret has been modified, and that should be expected.

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

Understood but for the last error: I created the secret and patched it to be moved from the bootstrap to the target cluster as per the tutorial. In no case has it been modified afterwards which is why I posted it here.

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

"The second error would be interesting to investigate upon, but there I need, as mentioned above, the logs of the machine and the relevant statuses of the host during provisioning."
=> I will re-create the cluster later today and post the logs of kubectl get HetznerBareMetalHost during provisioning.

What other logs are of interest?

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

@janiskemper

I was not able to reproduce the request-sending-loop yesterday as I watched the progress closely and fixed the routing-issue just-in-time on each machine during the very first stage of cloud-init.

However I checked the logs on Hetzner again for the 17th (as an example) and attach them here as csv.
You can clearly see that multiple times (four as examples in the csv) the controller got stuck in a api-request loop which should never happen, regardless of any issue.

Also I was able to reliably reproduce the routing-issue and the error state for the HetznerBareMetalHosts.

For my tests yesterday I set bm-worker replicas to 0. When the loop-issues arose I always had all bm-nodes included in the generated cluster.yaml (6 in total).

I will try again without my manual routing-fixes and bm-worker replicas set to 3 as before later today or tomorrow to see if I can reproduce the issue and fetch the logs.

One thing though: When the reset-commands where issued I was always able to ssh into the reset machine which clearly did NOT reset so this might as well be an issue with Hetzners api not issuing the reset for whatever reason and telling the controller that the reset was unsuccessful.

api-loops.csv

from cluster-api-provider-hetzner.

Privatecoder commented on June 18, 2024

I moved the secret modified issue to a separate issue

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

Sorry, I'm not 100% sure about which issue you are talking that you wanted to reproduce yesterday... Thanks for the logs from Hetzner, but I would be more interested in the logs of the CAPH Kubernetes operator and the corresponding host there. Especially, when the operator apparently fires a lot of API requests...

from cluster-api-provider-hetzner.

janiskemper commented on June 18, 2024

So if you can reproduce that reliably I would ask you to open another issue (as this one had another original purpose AFAIK) and attach the logs of the operator and the corresponding host objects if something unexpected happens while provisioning!

from cluster-api-provider-hetzner.

cluster with baremetal host waiting for available machine about cluster-api-provider-hetzner HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent