Giter Club home page Giter Club logo

Comments (20)

n1r1 avatar n1r1 commented on September 24, 2024 1

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?

The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.

from machine-api-operator.

JoelSpeed avatar JoelSpeed commented on September 24, 2024 1

@mshitrit This is relevant to the conversation we had last week

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024 1

Looks like this will be addressed by #814 (openshift/enhancements#673).

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024

/cc @n1r1 @beekhof

from machine-api-operator.

JoelSpeed avatar JoelSpeed commented on September 24, 2024

We discussed an issue very similar to this yesterday during backlog grooming.

We discussed several ideas during that call:

  1. Introduce some sort of backoff mechanism
    The MHC controller would keep track of when it had last remediated a machine from a particular MHC and try to detect hot loops and back-off. This is technically complex and would be tough to implement.

  2. Introduce a minimumAge (name TBC) field to the MHC
    The MHC would wait until the Machine was at least minimumAge before deleting it. So if a Machine went into the Failed phase immediately, there was at least some period in which you can catch this.

  3. Don't remediate a Failed machine unless it has a nodeRef
    If a Machine failed before it got a nodeRef this means the configuration was invalid for some reason. Deleting the Machine is unlikely to resolve this issue. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.

What do people think of these potential options?

from machine-api-operator.

Danil-Grigorev avatar Danil-Grigorev commented on September 24, 2024

MachineHealthChecks are evaluating Machine health based on the Node conditions. Machine resource here is just a bridge to the unhealthy Node resource. Option 3 seems most sane, as we should not remidiate Machines without a valid nodeRef, and we should be dependent on the NodeStartupTimeout from MHC for expected Node creation, before removing any Machine resource. Including but not being limited by the Failed ones.

from machine-api-operator.

michaelgugino avatar michaelgugino commented on September 24, 2024
3\. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.

I want spot instances to not go failed. They should spin until the price/capacity is available.

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024

My plan for the baremetal provider (implemented in openshift/cluster-api-provider-baremetal#113) when there is no capacity available is to:

  • Retry after a delay in case capacity becomes available
  • Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first
  • Do not go to the Failed phase
  • Obviously clear the ErrorMessage once provisioning starts

I'm not familiar with how spot instances are handled, but it seems like something similar could work.

I like the idea of looking for a nodeRef to decide whether to immediately remediate, and I agree it would be a good idea to still delete the Failed Machine after NodeStartupTimeout, by just using the existing code for how long a Machine has to get to the next state if there is no Node.

from machine-api-operator.

JoelSpeed avatar JoelSpeed commented on September 24, 2024
  • Retry after a delay in case capacity becomes available
  • Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first
  • Do not go to the Failed phase
  • Obviously clear the ErrorMessage once provisioning starts

I think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.

good idea to still delete the Failed Machine after NodeStartupTimeout

Are we talking here about Machines that go failed because of bad config? Would that not just cause a new machine to be created with an equally bad config?

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024
  • Retry after a delay in case capacity becomes available
  • Set an ErrorMessage so that if the MachineSet is scaled down then the Machine with no provider goes first
  • Do not go to the Failed phase
  • Obviously clear the ErrorMessage once provisioning starts

I think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.

It doesn't set it at present, but you can set it directly in the actuator (in fact, in the upstream Machine API it's always the actuator that sets it, and a separate controller looks for error messages and changes the Phase). It's a slight hack, but it displays useful information to the user (the Machine is Provisioning but there's an error because capacity is not available) and gets the MachineSet to do the right thing (prioritise for deletion on scale-down) while allowing the actuator to retry on its own schedule, even if there is no MachineHealthCheck.

good idea to still delete the Failed Machine after NodeStartupTimeout

Are we talking here about Machines that go failed because of bad config?

Yes, sorry, I switched contexts without making that clear.

Would that not just cause a new machine to be created with an equally bad config?

Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?

from machine-api-operator.

n1r1 avatar n1r1 commented on September 24, 2024

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

NodeStartupTimeout could be too long for such scenarios on one hand, and hot looping create-delete machines is very confusing for the end user.

So maybe we can modify the machine controller to reconcile failed machines, and call Create() until the machine has been successfully created. If we don't delete the machine, I believe that exponential backoff can be implemented more easily.

from machine-api-operator.

JoelSpeed avatar JoelSpeed commented on September 24, 2024

I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.

In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?

Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?

There isn't no, well, nothing built into MachineSets at least. If the autoscaler is involved it should detect that the MachineSet didn't bring up the new nodes and scale it back down which should remove the Failed machines

from machine-api-operator.

openshift-bot avatar openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024

/remove-lifecycle stale

from machine-api-operator.

openshift-bot avatar openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

zaneb avatar zaneb commented on September 24, 2024

/remove-lifecycle stale

from machine-api-operator.

openshift-bot avatar openshift-bot commented on September 24, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

from machine-api-operator.

openshift-bot avatar openshift-bot commented on September 24, 2024

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

from machine-api-operator.

openshift-bot avatar openshift-bot commented on September 24, 2024

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

from machine-api-operator.

openshift-ci avatar openshift-ci commented on September 24, 2024

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from machine-api-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.