The enhancement doc for the Machine lifecycle <a href="https://github.com/openshift/en

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks like this will be addressed by <a class="issue-link js-issue-link" data-error-te

/cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

MachineHealthChecks are evaluating <code class="notra

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

My plan for the baremetal provider (implemented in <a class="issue-link js-issue-link"

MachineHealthCheck fights with MachineSet on invalid configuration about machine-api-operator HOT 20 CLOSED

openshift commented on September 24, 2024

MachineHealthCheck fights with MachineSet on invalid configuration

from machine-api-operator.

Comments (20)

n1r1 commented on September 24, 2024 1

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?

The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.

from machine-api-operator.

JoelSpeed commented on September 24, 2024 1

@mshitrit This is relevant to the conversation we had last week

from machine-api-operator.

zaneb commented on September 24, 2024 1

Looks like this will be addressed by #814 (openshift/enhancements#673).

from machine-api-operator.

zaneb commented on September 24, 2024

/cc @n1r1 @beekhof

from machine-api-operator.

JoelSpeed commented on September 24, 2024

We discussed an issue very similar to this yesterday during backlog grooming.

We discussed several ideas during that call:

Introduce some sort of backoff mechanism
The MHC controller would keep track of when it had last remediated a machine from a particular MHC and try to detect hot loops and back-off. This is technically complex and would be tough to implement.
Introduce a minimumAge (name TBC) field to the MHC
The MHC would wait until the Machine was at least minimumAge before deleting it. So if a Machine went into the Failed phase immediately, there was at least some period in which you can catch this.
Don't remediate a Failed machine unless it has a nodeRef
If a Machine failed before it got a nodeRef this means the configuration was invalid for some reason. Deleting the Machine is unlikely to resolve this issue. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.

What do people think of these potential options?

from machine-api-operator.

Danil-Grigorev commented on September 24, 2024

MachineHealthChecks are evaluating Machine health based on the Node conditions. Machine resource here is just a bridge to the unhealthy Node resource. Option 3 seems most sane, as we should not remidiate Machines without a valid nodeRef, and we should be dependent on the NodeStartupTimeout from MHC for expected Node creation, before removing any Machine resource. Including but not being limited by the Failed ones.

from machine-api-operator.

michaelgugino commented on September 24, 2024

3\. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.

I want spot instances to not go failed. They should spin until the price/capacity is available.

from machine-api-operator.

zaneb commented on September 24, 2024

My plan for the baremetal provider (implemented in openshift/cluster-api-provider-baremetal#113) when there is no capacity available is to:

Retry after a delay in case capacity becomes available
Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first
Do not go to the Failed phase
Obviously clear the ErrorMessage once provisioning starts

I'm not familiar with how spot instances are handled, but it seems like something similar could work.

I like the idea of looking for a nodeRef to decide whether to immediately remediate, and I agree it would be a good idea to still delete the Failed Machine after NodeStartupTimeout, by just using the existing code for how long a Machine has to get to the next state if there is no Node.

from machine-api-operator.

JoelSpeed commented on September 24, 2024

Retry after a delay in case capacity becomes available

Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first

Do not go to the Failed phase

Obviously clear the ErrorMessage once provisioning starts

I think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.

good idea to still delete the Failed Machine after NodeStartupTimeout

Are we talking here about Machines that go failed because of bad config? Would that not just cause a new machine to be created with an equally bad config?

from machine-api-operator.

zaneb commented on September 24, 2024

Retry after a delay in case capacity becomes available

Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first

Do not go to the Failed phase

Obviously clear the ErrorMessage once provisioning starts

I think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.

It doesn't set it at present, but you can set it directly in the actuator (in fact, in the upstream Machine API it's always the actuator that sets it, and a separate controller looks for error messages and changes the Phase). It's a slight hack, but it displays useful information to the user (the Machine is Provisioning but there's an error because capacity is not available) and gets the MachineSet to do the right thing (prioritise for deletion on scale-down) while allowing the actuator to retry on its own schedule, even if there is no MachineHealthCheck.

good idea to still delete the Failed Machine after NodeStartupTimeout

Are we talking here about Machines that go failed because of bad config?

Yes, sorry, I switched contexts without making that clear.

Would that not just cause a new machine to be created with an equally bad config?

Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?

from machine-api-operator.

n1r1 commented on September 24, 2024

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

NodeStartupTimeout could be too long for such scenarios on one hand, and hot looping create-delete machines is very confusing for the end user.

So maybe we can modify the machine controller to reconcile failed machines, and call Create() until the machine has been successfully created. If we don't delete the machine, I believe that exponential backoff can be implemented more easily.

from machine-api-operator.

JoelSpeed commented on September 24, 2024

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?

Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?

There isn't no, well, nothing built into MachineSets at least. If the autoscaler is involved it should detect that the MachineSet didn't bring up the new nodes and scale it back down which should remove the Failed machines

from machine-api-operator.

openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

zaneb commented on September 24, 2024

/remove-lifecycle stale

from machine-api-operator.

openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

zaneb commented on September 24, 2024

/remove-lifecycle stale

from machine-api-operator.

openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

openshift-bot commented on September 24, 2024

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

from machine-api-operator.

openshift-bot commented on September 24, 2024

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

from machine-api-operator.

openshift-ci commented on September 24, 2024

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from machine-api-operator.

MachineHealthCheck fights with MachineSet on invalid configuration about machine-api-operator HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent