Comments (20)
I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?
The CAPBM actuator currently returns invalid conf error only if the fields are missing, so it matches your description.
from machine-api-operator.
@mshitrit This is relevant to the conversation we had last week
from machine-api-operator.
Looks like this will be addressed by #814 (openshift/enhancements#673).
from machine-api-operator.
from machine-api-operator.
We discussed an issue very similar to this yesterday during backlog grooming.
We discussed several ideas during that call:
-
Introduce some sort of backoff mechanism
The MHC controller would keep track of when it had last remediated a machine from a particular MHC and try to detect hot loops and back-off. This is technically complex and would be tough to implement. -
Introduce a
minimumAge
(name TBC) field to the MHC
The MHC would wait until the Machine was at leastminimumAge
before deleting it. So if a Machine went into theFailed
phase immediately, there was at least some period in which you can catch this. -
Don't remediate a
Failed
machine unless it has anodeRef
If a Machine failed before it got anodeRef
this means the configuration was invalid for some reason. Deleting the Machine is unlikely to resolve this issue. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.
What do people think of these potential options?
from machine-api-operator.
MachineHealthChecks
are evaluating Machine
health based on the Node
conditions. Machine
resource here is just a bridge to the unhealthy Node resource. Option 3 seems most sane, as we should not remidiate Machines
without a valid nodeRef
, and we should be dependent on the NodeStartupTimeout
from MHC
for expected Node
creation, before removing any Machine
resource. Including but not being limited by the Failed
ones.
from machine-api-operator.
3\. The one case where this falls down that I'm aware of is spot instances. We may want to detect spot instance creation failure and requeue rather than going failed as we presently do.
I want spot instances to not go failed. They should spin until the price/capacity is available.
from machine-api-operator.
My plan for the baremetal provider (implemented in openshift/cluster-api-provider-baremetal#113) when there is no capacity available is to:
- Retry after a delay in case capacity becomes available
- Set an
ErrorMessage
so that if the MachineSet is scaled down then the Machine with no provider goes first - Do not go to the
Failed
phase - Obviously clear the
ErrorMessage
once provisioning starts
I'm not familiar with how spot instances are handled, but it seems like something similar could work.
I like the idea of looking for a nodeRef
to decide whether to immediately remediate, and I agree it would be a good idea to still delete the Failed
Machine after NodeStartupTimeout
, by just using the existing code for how long a Machine has to get to the next state if there is no Node.
from machine-api-operator.
- Retry after a delay in case capacity becomes available
- Set an
ErrorMessage
so that if the MachineSet is scaled down then the Machine with no provider goes first- Do not go to the
Failed
phase- Obviously clear the
ErrorMessage
once provisioning starts
I think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.
good idea to still delete the
Failed
Machine afterNodeStartupTimeout
Are we talking here about Machines that go failed because of bad config? Would that not just cause a new machine to be created with an equally bad config?
from machine-api-operator.
- Retry after a delay in case capacity becomes available
- Set an
ErrorMessage
so that if the MachineSet is scaled down then the Machine with no provider goes first- Do not go to the
Failed
phase- Obviously clear the
ErrorMessage
once provisioning startsI think this could work with spot, technically the error message field is meant for terminal errors only though. I'm not sure if the Machine Controller (presently) will set this field unless the machine has gone failed. We should check this, but it may also be able to be changed.
It doesn't set it at present, but you can set it directly in the actuator (in fact, in the upstream Machine API it's always the actuator that sets it, and a separate controller looks for error messages and changes the Phase). It's a slight hack, but it displays useful information to the user (the Machine is Provisioning
but there's an error because capacity is not available) and gets the MachineSet to do the right thing (prioritise for deletion on scale-down) while allowing the actuator to retry on its own schedule, even if there is no MachineHealthCheck.
good idea to still delete the
Failed
Machine afterNodeStartupTimeout
Are we talking here about Machines that go failed because of bad config?
Yes, sorry, I switched contexts without making that clear.
Would that not just cause a new machine to be created with an equally bad config?
Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed
Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?
from machine-api-operator.
I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.
NodeStartupTimeout
could be too long for such scenarios on one hand, and hot looping create-delete machines is very confusing for the end user.
So maybe we can modify the machine controller to reconcile failed machines, and call Create()
until the machine has been successfully created. If we don't delete the machine, I believe that exponential backoff can be implemented more easily.
from machine-api-operator.
I'm wondering if the ProviderSpec can be invalid at some point of time but become valid after a while.
For example, in the baremetal case, maybe the image url was not reachable at the time of machine creation, but later it would be reachable.
In this case I would have said that the actuator should not return an invalid configuration error if you think this is a non-terminal error and that instead, you should requeue after some time period. Is this something you'd expect to happen often?
Indeed, unless that config has changed in the meantime, but at least it would be slow enough that you could see what was going on. I don't think there's another mechanism of making sure that when the config is fixed, any Failed Machines with invalid configuration errors are removed so that the MachineSet can reach its intended capacity?
There isn't no, well, nothing built into MachineSets at least. If the autoscaler is involved it should detect that the MachineSet didn't bring up the new nodes and scale it back down which should remove the Failed
machines
from machine-api-operator.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
from machine-api-operator.
/remove-lifecycle stale
from machine-api-operator.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
from machine-api-operator.
/remove-lifecycle stale
from machine-api-operator.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
from machine-api-operator.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
from machine-api-operator.
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
from machine-api-operator.
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from machine-api-operator.
Related Issues (20)
- Unable to deploy machinesets on IBM Cloud HOT 5
- Bandwidth specs not available for nodes added by Machineset API in AWS HOT 10
- Need mertics on maxReplicas and minReplicas of MachineAutoscaler HOT 5
- Labels do not propagate to node(s) from machineset API HOT 7
- Failed Nodes are not retried by Machineset API HOT 6
- Use Operator Framework for Instascale controller HOT 1
- support huaweicloud HOT 4
- Is it possible to create a `MachineSet` with `VSphereMachineProviderSpec` and specify `thin`provisioning? HOT 1
- vSphere Tags not working HOT 1
- MachineSets with Node Auto-Provisioning feature
- Precedence in Cloud Identification HOT 4
- cluster-api HOT 5
- Update metrics doc with provider controller information HOT 5
- Update metrics doc with controller specific information HOT 2
- [vSphere] Hashed machine names not working with external load balancer and ingress HOT 6
- MachineHealthCheck not working if Node goes down HOT 5
- how to recover the master node when the master machine status is Failed HOT 4
- Feature Request: Allow for local-ssd in GCP Actuator HOT 4
- kubeRBACProxy image are not available in pkg/operator/fixtures/images.json HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from machine-api-operator.