Comments (17)
I am unsure if this can be covered under transient error handling.
My understanding of this:
Transient error handling is where we attempt a network call, and the result of the call is one that if we retry it later on it may succeed.
What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted.
What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.
In the specific case above it appears that the pod never started running (it has transitioned from state Pending
), but we cannot actually know this, as the controller may have missed the event that said it started running.
@tooptoop4 did this node have retry on?
from argo-workflows.
retryStrategy:
limit: "2"
retryPolicy: "Always"
expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [143])'
from argo-workflows.
I don't think we should include this as built-in, which is why I am suggesting trying the env var.
from argo-workflows.
node marked as failed
from argo-workflows.
In which case the controller is doing the right thing, isn't it?
It will evaluate your expression
AND policy: Always
, but your expression is false, so it won't retry.
from argo-workflows.
can this be treated as a transient error and autoretried by the controller?
You can set the env var TRANSIENT_ERROR_PATTERN
to add additional patterns to be treated as transient.
Not sure if there would be built-in support for this per #12567 (review), cc @terrytangyuan . It is a node error, so maybe
from argo-workflows.
retries are enabled but this did not retry like i expected it would
from argo-workflows.
retries are enabled but this did not retry like i expected it would
Ok, that's unexpected, I would have expected this to be retried.
from argo-workflows.
What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted.
What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.
Oh I didn't actually look up the error message (and have never seen it before myself, or at least not recently). This looks like it is indeed a type of eviction. Agreed in that case that this does not quite match a transient pattern then.
I was thinking this might have been an error message from a race with node-problem-detector
or something and so the next try might get scheduled on a different node due to a taint, but if this is an eviction by the kubelet, then that is indeed not guaranteed as there is no taint added to the node
retries are enabled but this did not retry like i expected it would
@tooptoop4 what was your retryStrategy
set to? There wasn't a Workflow attached to your issue report
from argo-workflows.
Was the node marked as failed? If not, might be fixed by #12197. This is potentially duplicative of #12231 in that case
from argo-workflows.
Have you tried setting TRANSIENT_ERROR_PATTERN
?
from argo-workflows.
@terrytangyuan I mentioned that above already. I cc'ed you before as I thought you might have a decision regarding what should and shouldn't be built-in.
Per Alan's comments though, this actually wouldn't fit the criteria of a transient error for the Controller anyway, so I think we have decisive disqualification now. The env var could still potentially be used as a user-land workaround for this kind of thing though.
The retryStrategy
not working in this case is potentially a bug though
from argo-workflows.
The node failed but the retry didn't run?
from argo-workflows.
correct, node failed but retry didn't run. i have
retryPolicy: "Always"
expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [255])'
from argo-workflows.
If the node Failed
then your expression requires the exitCode
to be 255
. Can you verify that was the case?
from argo-workflows.
from what i can see there is no exit code, like the pod could never start. there are no logs for main/init/wait containers
from argo-workflows.
how can the expression cover a pod that did not start?
from argo-workflows.
Related Issues (20)
- Expr not working with `null` after upgrading to 3.5.5 HOT 2
- cannot retry large archived workflow that needs offloading HOT 5
- Validation fails when using expressions in GCS Artifact bucket/key (only when in subordinate step/task) HOT 1
- TestHTTPArtifactDriver_Load intermittently fails on the windows CI suite HOT 3
- Enable draft GHSAs HOT 3
- Move branch protection rules to rulesets HOT 1
- local image build: support arm and amd64 HOT 2
- proposal: workflow concurrency control
- Workflow logs button does not fetch archived logs but node summary logs button does HOT 11
- running Workflow locally failed at `wait` container with `metadata.ownerReferences.ui: Invalid value: "": uid must not be empty` HOT 8
- Workflow failed at wait container because of `executor error: Unauthorized` HOT 7
- UI: only call `getInfo` once -- move `getInfo` request results to context
- option to append timestamp prefix to workflow name when workflowtemplate is submitted via cli HOT 1
- Cronworkflow with concurrencyPolicy: Replace/Forbid should not queue up pending workflows
- Retry on PutFile failure in s3 HOT 10
- GKE Workload Identity Service Account Failure HOT 2
- `User "system:serviceaccount:argo:default" cannot patch resource "pods" in API group "" in the namespace "argo"` HOT 6
- `look-up entrypoint/cmd for image` `unexpected status code 401 Unauthorized: Auth failed` HOT 3
- JSON logging does not work for `resource` template HOT 7
- Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from argo-workflows.