Comments (24)
I'm not sure what happens when you move the cluster. @batistein should answer then
from cluster-api-provider-hetzner.
Okay, please watch the host object, especially when it comes to the moment when it is stuck in provisioning and sends all the reset requests.
And then there are logs that concern this specific host object in the CAPH controller. That would be interesting :)
from cluster-api-provider-hetzner.
from cluster-api-provider-hetzner.
@Privatecoder: thanks. I think my problem is different. At least I don't see any 'failed to set ssh key' errors. The machine is provisioned for a while but it is unhealthy and doesn't start the process of joining the cluster. The fact that call from caph -> load balancer -> control plane shows error is suspicious. But maybe it is not the root cause and I need to look somewhere else.
After an hour timeout expires and the machine start reboot and bootstrap again under new name.
from cluster-api-provider-hetzner.
@radimk can you check whether the Hetzner loadbalancer has the correct targets (also the bare metal servers)?
And then, can you please check the status of cloud-init on the server with cloud-init status
and see whether there is something interesting in /var/log/cloud-init-output.log
?
from cluster-api-provider-hetzner.
@janiskemper loadbalancers points only to control-plane node. cloud-init status
says status: done
and log file is like this:
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Thu, 03 Nov 2022 15:13:57 +0000. Up 5.22 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Thu, 03 Nov 2022 15:14:14 +0000. Up 22.65 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | enp0s31f6 | True | 46.4.66.173 | 255.255.255.255 | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | 2a01:4f8:140:2484::2/64 | . | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | fe80::921b:eff:fecc:ce73/64 | . | link | 90:1b:0e:cc:ce:73 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 46.4.66.129 | 0.0.0.0 | enp0s31f6 | UG |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: ++++++++++++++++++++++++Route IPv6 info+++++++++++++++++++++++++
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | 1 | 2a01:4f8:140:2484::/64 | :: | enp0s31f6 | U |
ci-info: | 2 | fe80::/64 | :: | enp0s31f6 | U |
ci-info: | 3 | ::/0 | fe80::1 | enp0s31f6 | UG |
ci-info: | 5 | local | :: | enp0s31f6 | U |
ci-info: | 6 | local | :: | enp0s31f6 | U |
ci-info: | 7 | multicast | :: | enp0s31f6 | U |
ci-info: +-------+------------------------+---------+-----------+-------+
2022-11-03 15:14:14,746 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:14:15 +0000. Up 23.02 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:14:15 +0000. Up 23.27 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:14:15 +0000. Datasource DataSourceNone. Up 23.35 seconds
2022-11-03 15:14:15,367 - cc_final_message.py[WARNING]: Used fallback datasource
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Thu, 03 Nov 2022 15:15:08 +0000. Up 5.02 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.12 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | enp0s31f6 | True | 46.4.66.173 | 255.255.255.255 | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | 2a01:4f8:140:2484::2/64 | . | global | 90:1b:0e:cc:ce:73 |
ci-info: | enp0s31f6 | True | fe80::921b:eff:fecc:ce73/64 | . | link | 90:1b:0e:cc:ce:73 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +-----------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 46.4.66.129 | 0.0.0.0 | enp0s31f6 | UG |
ci-info: +-------+-------------+-------------+---------+-----------+-------+
ci-info: ++++++++++++++++++++++++Route IPv6 info+++++++++++++++++++++++++
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+------------------------+---------+-----------+-------+
ci-info: | 1 | 2a01:4f8:140:2484::/64 | :: | enp0s31f6 | U |
ci-info: | 2 | fe80::/64 | :: | enp0s31f6 | U |
ci-info: | 3 | ::/0 | fe80::1 | enp0s31f6 | UG |
ci-info: | 5 | local | :: | enp0s31f6 | U |
ci-info: | 6 | local | :: | enp0s31f6 | U |
ci-info: | 7 | multicast | :: | enp0s31f6 | U |
ci-info: +-------+------------------------+---------+-----------+-------+
2022-11-03 15:15:26,227 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.47 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.73 seconds.
Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:15:26 +0000. Datasource DataSourceNone. Up 23.81 seconds
2022-11-03 15:15:26,833 - cc_final_message.py[WARNING]: Used fallback datasource
Digging more on given warning
root@bm-g-work-md-1-kh22c ~ # cloud-init schema --system
Cloud config schema deprecations:
Error:
Cloud config schema errors: users.0: {'gecos': 'ubuntu', 'groups': {'adm', 'lxd', 'plugdev', 'sudo', 'dip', 'cdrom'}, 'lock_passwd': False, 'name': 'ubuntu', 'passwd': '...', 'shell': '/bin/bash'} is not valid under any of the given schemas
from cluster-api-provider-hetzner.
There are a few errors in CAPH logs. Error when accessing API group resources. It calls load balancer that is always pointing to HCloud control plane node (not HCloud node if I add it, not bare metal server). The second type shows an error failed to save host status after \"registering\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\"
or the same after "provisioning". Later there is also "error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"provisioning\" failed: failed to handle incomplete boot - installImage: unexpected hostname bm-g-work-md-1-pmkq7","errorVerbose":"unexpected hostname bm-g-work-md-1-pmkq7\nfailed to handle incomplete boot - installImage
{"level":"ERROR","time":"2022-11-03T18:39:56.237Z","file":"[email protected]/logr.go:279","message":"Failed to get API Group-Resources","error":"Get \"https://49.12.21.105:443/api?timeout=32s\": EOF","stacktrace":"github.com/go-logr/logr.Logger.Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:279\nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:322\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).newTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:582\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:514\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"ERROR","time":"2022-11-03T18:39:56.238Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznercluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerCluster","hetznerCluster":{"name":"g-work","namespace":"default"},"namespace":"default","name":"g-work","reconcileID":"5f143185-9dd0-42a1-8be3-2dd0b8fc4c8c","error":"failed to reconcile target cluster manager: failed to create a clusterManager for HetznerCluster default/g-work: failed to setup guest cluster manager: Get \"https://49.12.21.105:443/api?timeout=32s\": EOF","errorVerbose":"Get \"https://49.12.21.105:443/api?timeout=32s\": EOF\nfailed to setup guest cluster manager\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).newTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:591\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:514\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to create a clusterManager for HetznerCluster default/g-work\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileTargetClusterManager\n\t/workspace/controllers/hetznercluster_controller.go:516\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:221\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to reconcile target cluster manager\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).reconcileNormal\n\t/workspace/controllers/hetznercluster_controller.go:222\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerClusterReconciler).Reconcile\n\t/workspace/controllers/hetznercluster_controller.go:159\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
{"level":"ERROR","time":"2022-11-03T18:43:15.016Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"e1e5e31b-e4ca-48b2-9767-ca5df59f1b1f","error":"failed to reconcile HetznerBareMetalHost default/bm-0: failed to save host status after \"registering\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again\nfailed to update status\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.saveHost\n\t/workspace/pkg/services/baremetal/host/host.go:132\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).Reconcile\n\t/workspace/pkg/services/baremetal/host/host.go:86\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:146\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to save host status after \"registering\"\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).Reconcile\n\t/workspace/pkg/services/baremetal/host/host.go:87\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:146\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfailed to reconcile HetznerBareMetalHost default/bm-0\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:147\ngithub.com/syself/cluster-api-provider-hetzner/controllers.(*HetznerBareMetalHostReconciler).Reconcile\n\t/workspace/controllers/hetznerbaremetalhost_controller.go:139\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"}
Another thing that puzzles me is the address shown in kubectl describe HetznerBareMetalMachine
that is x.x.x.x/26 and not x.x.x.x/32. Any idea why is that?
Status:
Addresses:
Address: 46.4.66.173/26
from cluster-api-provider-hetzner.
One minor thing - but important for error replication: You are using a different OS Type. Please try again with the default configuration to see if this error is gone then.
In general we have to distinguish if the issue is an error in the controller or in the configuration. The configuration templates we provide in the releases are just a starting point and should be adjusted to your needs. By changing the configuration errors may occur as a result of this, and a Kubernetes cluster may not work. Unfortunately these sort of errors are not within the scope of this project and must be solved by the users themselves. To support all of us on the path of configuration, I think this would be then properly placed in the discussion section in the first place....
Therefore I would be currently focus on the fact if this can be caused by the machine itself as this would be in the scope. (remember cluster-api only handles machine lifecycle as well as the necessary infrastructure and provides only interfaces/api for Kubernetes configuration).
from cluster-api-provider-hetzner.
I tried it again yesterday with installImage set to /root/images/Ubuntu-2004-focal-64-minimal-hwe.tar.gz
to no avail. And I added one HCloud machine as a worker node to see if it would be configured. The result is that control-plane node is up, HCloud node is up, but bare-metal is not. In my recent tries it only shows these conditions:
Conditions:
Last Transition Time: 2022-11-03T18:43:08Z
Status: True
Type: AssociateBMHCondition
Last Transition Time: 2022-11-03T18:43:08Z
Status: True
Type: InstanceBootstrapReady
Last Updated: 2022-11-03T18:44:44Z
It doesn't reach InstanceReady
.
Hetzner load balancer is set up and it always point to control-plane only.
What I see looks like this output below. g-work
is the one I want to create with bare-metal, g-mgmt
was created before from local kind
and its config was moved. SSH should use key that works for that machine (I can log using this key from my machine).
$ kubectl get hetznerbaremetalmachines,hcloudmachines,hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io
NAME AGE PROVIDERID READY CLUSTER
hetznerbaremetalmachine.infrastructure.cluster.x-k8s.io/g-work-md-1-j7fdc 99m g-work
NAME CLUSTER IMAGE PLACEMENT GROUP TYPE STATE READY INSTANCEID MACHINE
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-control-plane-8snwk g-mgmt 1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856 control-plane cx21 running true hcloud://25200385 g-mgmt-control-plane-wvlbn
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-md-0-kqqqz g-mgmt 1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856 md-0 cx21 running true hcloud://25200426 g-mgmt-md-0-77c6c445fd-bfl7n
hcloudmachine.infrastructure.cluster.x-k8s.io/g-mgmt-md-0-l8c8t g-mgmt 1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856 md-0 cx21 running true hcloud://25200427 g-mgmt-md-0-77c6c445fd-nftzg
hcloudmachine.infrastructure.cluster.x-k8s.io/g-work-control-plane-5qfvp g-work 1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856 control-plane cx21 running true hcloud://25261806 g-work-control-plane-plnc7
hcloudmachine.infrastructure.cluster.x-k8s.io/g-work-md-0-zph4k g-work 1.25.2-ubuntu-22.04-containerd-cx21-2022-11-01-1856 md-0 cx21 running true hcloud://25261867 g-work-md-0-7d56f46cb6-5jfjw
NAME STATE IPV4 IPV6 THREADS CLOCK SPEED RAM IN GB CONSUMER ERRORTYPE ERRORMESSAGE AGE
hetznerbaremetalhost.infrastructure.cluster.x-k8s.io/bm-0 provisioning 46.4.66.173 2a01:4f8:140:2484::1 8 4.2 64 g-work-md-1-j7fdc connection refused error of SSH command ssh gave connection error 99m
from cluster-api-provider-hetzner.
This one here shows that cloud-init has a problem
2022-11-03 15:15:26,227 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:config' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.47 seconds. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'modules:final' at Thu, 03 Nov 2022 15:15:26 +0000. Up 23.73 seconds. Cloud-init v. 22.3.4-0ubuntu1~22.04.1 finished at Thu, 03 Nov 2022 15:15:26 +0000. Datasource DataSourceNone. Up 23.81 seconds 2022-11-03 15:15:26,833 - cc_final_message.py[WARNING]: Used fallback datasource
You could also double check this by trying crictl ps
I guess there are no containers coming up.
A quick google research led me to this: https://discourse.ubuntu.com/t/release-of-cloud-init-22-2/28471/5
Maybe that helps?
from cluster-api-provider-hetzner.
TL;DR I was able to initialize the cluster with some manual tricks. Some questions are still open for me. And I am curious what will happen when I try to add second node. Thanks for your patience.
I looked at Hetzner Robot UI to see what is happening with the server and I see always several reset requests and then long wait until 60m timeout expires and CAHP tries the sequence again. Is it expected that reset are send repeatedly?
Change server name (Server Auction #1702753) | 03/11/22 09:52 am CET | Success
Execute reset (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
Execute reset (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
Execute reset (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
Activate rescue system (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
Deactivate rescue system (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
Change server name (Server Auction #1702753) | 03/11/22 09:49 am CET | Success
I did last try yesterday after 22pm and that run didn't proceed to cloud-init setup https://github.com/syself/cluster-api-provider-hetzner/blob/main/pkg/services/baremetal/host/host.go#L1190 because it got stuck in previous step.
Start 22:13 with deactivate & active rescue, reset. Change name at 22:16. Then 'image-installing' should be done:
{"level":"INFO","time":"2022-11-04T21:16:10.990Z","file":"host/state_machine.go:67","message":"changing provisioning state","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name"
:"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"672b9c67-37ed-4b33-b4d9-369a517099ff","old":"image-installing","new":"provisioning"}
Failed attempt to SSH to that machine (it is not fully booted yet accoring to its syslog) and shortly after the reconciler fails to store new status 'provisioning'. Is it possible that there is some race condition (due failed SSH attempt or something else)? Is it a problem or not?
{"level":"ERROR","time":"2022-11-04T21:16:16.592Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0
","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"76e83f9d-f0d5-4aa7-8a6a-487c0a16a9df","error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"image-installing\" failed: failed to perform ssh command: failed to dial ssh. Error message: dial tcp 46.4.66.173:22: i/o timeout. DialErr: failed to dial ssh","errorVerbose":"failed to dial ssh. Error message: dial tcp 46.4.66.173:22: i/o timeout. DialErr: failed to dial ssh\nfailed to perform ssh command
...
{"level":"ERROR","time":"2022-11-04T21:16:26.817Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"fae68336-e896-449a-ac4a-2f5507de9ccb","error":"failed to reconcile HetznerBareMetalHost default/bm-0: failed to save host status after \"provisioning\": failed to update status: Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on hetznerbaremetalhosts.infrastructure.cluster.x-k8s.io \"bm-0\": the object has been modified; please apply your changes to the latest version and try again\nfailed to update status\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.saveHost
Machine rebooted at 22:17pm and auth.log
shows a dozen of SSH sessions from cluster running ClusterAPI/CAHP and those can be corelated with errors about wrong machine name. AFAICT CAHP connects and sees unexpected machine name so it disconnects immediately.
Nov 4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe sshd[680]: Server listening on 0.0.0.0 port 22.
Nov 4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe sshd[680]: Server listening on :: port 22.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: Accepted publickey for root from 167.235.203.98 port 51670 ssh2: RSA SHA256:RHeXEU88/67k8QYdUqUtNB7MTe5HoKUZKP/40QmYPlY
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: New session 1 of user root.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[704]: pam_unix(sshd:session): session closed for user root
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Session 1 logged out. Waiting for processes to exit.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Removed session 1.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: Accepted publickey for root from 167.235.203.98 port 51672 ssh2: RSA SHA256:RHeXEU88/67k8QYdUqUtNB7MTe5HoKUZKP/40QmYPlY
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: New session 3 of user root.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe sshd[806]: pam_unix(sshd:session): session closed for user root
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Session 3 logged out. Waiting for processes to exit.
Nov 4 22:17:31 Ubuntu-2004-focal-64-minimal-hwe systemd-logind[631]: Removed session 3.
{"level":"ERROR","time":"2022-11-04T21:17:31.213Z","file":"controller/controller.go:326","message":"Reconciler error","controller":"hetznerbaremetalhost","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"HetznerBareMetalHost","hetznerBareMetalHost":{"name":"bm-0","namespace":"default"},"namespace":"default","name":"bm-0","reconcileID":"ea9bb318-1309-4b46-85a9-905f59d0f998","error":"failed to reconcile HetznerBareMetalHost default/bm-0: action \"provisioning\" failed: failed to handle incomplete boot - installImage: unexpected hostname Ubuntu-2004-focal-64-minimal-hwe","errorVerbose":"unexpected hostname Ubuntu-2004-focal-64-minimal-hwe\nfailed to handle incomplete boot - installImage\ngithub.com/syself/cluster-api-provider-hetzner/pkg/services/baremetal/host.(*Service).actionProvisioning
BTW: the name is set during startup to a generic one:
/var/log/syslog.1:Nov 4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe systemd-resolved[616]: Using system hostname 'Ubuntu-2004-focal-64-minimal-hwe'.
/var/log/syslog.1:Nov 4 22:17:24 Ubuntu-2004-focal-64-minimal-hwe kernel: [ 34.082350] systemd[1]: Set hostname to <Ubuntu-2004-focal-64-minimal-hwe>.
Knowing this I made another attempt and manually patched hostname. The host and machine is ready but CAPI doesn't think it is healthy and logs these errors:
E1105 21:40:31.824505 1 machine_controller_noderef.go:192] "Failed to parse ProviderID" err="providerID is empty" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="default/g-work-md-1-5bd54bfd6c-962dt" namespace="default" name="g-work-md-1-5bd54bfd6c-962dt" reconcileID=4d8d53e8-f1df-442c-a84f-7da0b7a5c3f2 providerID="hcloud://bm-1702753" node="/bm-g-work-md-1-6fnbp"
I re-deployed Hetzner CCM again KUBECONFIG=$CAPH_WORKER_CLUSTER_KUBECONFIG helm upgrade --install ccm syself/ccm-hetzner --version 1.1.4 --namespace kube-system --set privateNetwork.enabled=false
and voila. It is up running.
from cluster-api-provider-hetzner.
@radimk It looks like that the hostname is not set up correctly via installimage. The process is the following:
- booting into rescue mode
- creating and executing the autosetup file. This runs installimage
- reboot and testing if we are connecting to the provisioned machine. Here we check the hostname. If it's not correct the controller thinks that there is an error.
- If the hostname is correct it proceeds with transfering the cloud-init configuration to the node. And executing it. The controller also tests if cloud-init was successful.
- The node should then show up in k8s. Now the ccm needs to initialize it. This is an independent controll loop. But as the machineHealthChecks are watching the node object, beeing stuck here, could also leads into a full recreation.
If you have configured MachineHealthChecks all of the above is done in the specified time frame. If the time frame is exceeded capi will delete the machine object and creates a new one so the process starts again.
from cluster-api-provider-hetzner.
I've been able to create a cluster with bare metal machine using fresh kind
cluster for its bootstraping. Later I added another bare-metal machine and it works well. I accidentally removed 2nd machine by changing replica count in g-work-md-1
and added it back. IMO we can close this issue. Thanks for your work.
from cluster-api-provider-hetzner.
Thanks @radimk! I will close it now.
from cluster-api-provider-hetzner.
I'm facing the same issues as @radimk here.
I created the same cluster dozens of times (three bm-controlplanes (k8s 1.24.7 and 1.24.7, Ubuntu 20.04 HWE default), three bm-workers (Ubuntu 20.04 HWE default), four hcloud-workers (Ubuntu 22.04 default), hetzner-baremetal-control-planes-remediation
flavor, v1.0.0-beta.6 and v1.0.0-beta.7) always with a fresh kind-bootstrap cluster and fresh HCloud-Project but it never initializes without errors.
In addition to the route-issue I experienced the same rate-limit error for bm-reset through the robot-api.
The first bm-controlplane is always fine. For the second or third one – during cluster creation – the controller will send endless reset-calls until the api locks due to the rate-limit.
As @radimk I was also able to initialize the cluster with some manual tricks, for example by removing and re-adding the HetznerBareMetalHost-object of the affected machine, ssh'ing into it (which should have been reset but wasn't) and triggering a manual reboot).
Also – after manually fixing things like routing and the rebooting issue – the CAPH-controller always logs "secret has been modified although a provisioned machine uses it","controller":"hetznerbaremetalhost"
for all HetznerBareMetalHosts.
The cluster is going to be used for production but only after I am sure it can safely be re-created without manual-intervention so I am happy to re-create it as many times as possibly required to help and debug.
from cluster-api-provider-hetzner.
What exactly do you mean by "errors"? The status of the host machine shows things like "cannot SSH into it" etc. This is not an error, but is merely used to track the progress.
The reset calls are sent if the controller thinks it is required. Could you maybe give us the exact status of the host object when it sends these requests in addition to the logs that come from that machine? Otherwise, it is hard to say.
And this thing with the secret: If you use a secret to provision a machine and then change it, obviously this isn't possible. The only option to recover from that is reprovisioning all machines that use that secret. This is just a warning for you that you shouldn't do this!
from cluster-api-provider-hetzner.
The first "error" I found was that the second controlplane could not join the cluster as cloud-init had timeout-errors. This was due to the wrong route-settings.
The next "error" was the infinite call of the robot-api until it locked due to the rate-limit (3600, so a lot of bm-reboot calls have been made – I did not investigate further / deeper – whenever this happened I killed the kind-bootstrap cluster to prevent further calls and therefore the Hetzner email-flood).
Another "error" was the mentioned "secret has been modified although a provisioned machine uses it","controller":"hetznerbaremetalhost" of the CAPH-controller.
from cluster-api-provider-hetzner.
The first error we'll investigate in the other issue (@batistein, as I'm not a networking expert).
The second error would be interesting to investigate upon, but there I need, as mentioned above, the logs of the machine and the relevant statuses of the host during provisioning.
The last error should only happen if the secret has been modified, and that should be expected.
from cluster-api-provider-hetzner.
Understood but for the last error: I created the secret and patched it to be moved from the bootstrap to the target cluster as per the tutorial. In no case has it been modified afterwards which is why I posted it here.
from cluster-api-provider-hetzner.
"The second error would be interesting to investigate upon, but there I need, as mentioned above, the logs of the machine and the relevant statuses of the host during provisioning."
=> I will re-create the cluster later today and post the logs of kubectl get HetznerBareMetalHost
during provisioning.
What other logs are of interest?
from cluster-api-provider-hetzner.
I was not able to reproduce the request-sending-loop yesterday as I watched the progress closely and fixed the routing-issue just-in-time on each machine during the very first stage of cloud-init.
However I checked the logs on Hetzner again for the 17th (as an example) and attach them here as csv.
You can clearly see that multiple times (four as examples in the csv) the controller got stuck in a api-request loop which should never happen, regardless of any issue.
Also I was able to reliably reproduce the routing-issue and the error state for the HetznerBareMetalHosts
.
For my tests yesterday I set bm-worker replicas to 0. When the loop-issues arose I always had all bm-nodes included in the generated cluster.yaml (6 in total).
I will try again without my manual routing-fixes and bm-worker replicas set to 3 as before later today or tomorrow to see if I can reproduce the issue and fetch the logs.
One thing though: When the reset-commands where issued I was always able to ssh into the reset machine which clearly did NOT reset so this might as well be an issue with Hetzners api not issuing the reset for whatever reason and telling the controller that the reset was unsuccessful.
from cluster-api-provider-hetzner.
I moved the secret modified issue to a separate issue
from cluster-api-provider-hetzner.
Sorry, I'm not 100% sure about which issue you are talking that you wanted to reproduce yesterday... Thanks for the logs from Hetzner, but I would be more interested in the logs of the CAPH Kubernetes operator and the corresponding host there. Especially, when the operator apparently fires a lot of API requests...
from cluster-api-provider-hetzner.
So if you can reproduce that reliably I would ask you to open another issue (as this one had another original purpose AFAIK) and attach the logs of the operator and the corresponding host objects if something unexpected happens while provisioning!
from cluster-api-provider-hetzner.
Related Issues (20)
- Link Checker Dashboard HOT 1
- add notes about having ccm and CNI installed
- Make list of Go imports that we use aliases for and update golangci-lint
- avoid warnings around finalizers HOT 1
- Advanced Baremetal guide HOT 1
- Guide how to upgrade clusters
- Document advanced configuration options
- Wrong condition is sometimes shown when HetznerBareMetalHost deprovisions
- HetznerBareMetalHost hangs in deprovisioning state if not fully provisioned
- Handle 403 error from Robot API when API credentials are wrong
- Multiple SSH key support for robot HOT 6
- I created a k8s cluster with cluster api and hezner provider, but the cluster never acheive the working state HOT 4
- unit-test for validateRootDeviceWwnsAreSubsetOfExistingWwns()
- Remove whitespace at the end of storage.vendor (hbmh)
- confusing hcloudmachinetemplate controller error when using k0s HOT 5
- HetznerBaremetalHosts stuck in: host is still provisioning - state "registering" HOT 7
- Add documentation for ssh secret name specified in secret after PR #1289
- How to enable basic Firewall settings for K8s Cluster, which was created using CAPI? HOT 3
- Network problem on node pool workloads-1 with access to the attached network HOT 3
- Setting up a basic Firewall on the Cilium side in K8s Cluster CAPI HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cluster-api-provider-hetzner.