Giter Club home page Giter Club logo

Comments (17)

Joibel avatar Joibel commented on June 3, 2024 1

I am unsure if this can be covered under transient error handling.

My understanding of this:
Transient error handling is where we attempt a network call, and the result of the call is one that if we retry it later on it may succeed.
What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted.
What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.

In the specific case above it appears that the pod never started running (it has transitioned from state Pending), but we cannot actually know this, as the controller may have missed the event that said it started running.

@tooptoop4 did this node have retry on?

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024 1
      retryStrategy:
        limit: "2"
        retryPolicy: "Always"
        expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [143])'

from argo-workflows.

terrytangyuan avatar terrytangyuan commented on June 3, 2024 1

I don't think we should include this as built-in, which is why I am suggesting trying the env var.

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024 1

node marked as failed

from argo-workflows.

Joibel avatar Joibel commented on June 3, 2024 1

In which case the controller is doing the right thing, isn't it?
It will evaluate your expression AND policy: Always, but your expression is false, so it won't retry.

from argo-workflows.

agilgur5 avatar agilgur5 commented on June 3, 2024

can this be treated as a transient error and autoretried by the controller?

You can set the env var TRANSIENT_ERROR_PATTERN to add additional patterns to be treated as transient.

Not sure if there would be built-in support for this per #12567 (review), cc @terrytangyuan . It is a node error, so maybe

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024

retries are enabled but this did not retry like i expected it would

from argo-workflows.

Joibel avatar Joibel commented on June 3, 2024

retries are enabled but this did not retry like i expected it would

Ok, that's unexpected, I would have expected this to be retried.

from argo-workflows.

agilgur5 avatar agilgur5 commented on June 3, 2024

What we have in this case is a successful call to create a pod in the cluster. The pod has been created, in that the kubernetes object is there. Then the cluster has either failed to make a running pod on a node, or has created it and then it has been evicted.
What cannot be automated is the knowledge that recreating a pod is a safe thing to do, and we defer to the user to mark a pod recreation as safe by them saying a step can be retried.

Oh I didn't actually look up the error message (and have never seen it before myself, or at least not recently). This looks like it is indeed a type of eviction. Agreed in that case that this does not quite match a transient pattern then.

I was thinking this might have been an error message from a race with node-problem-detector or something and so the next try might get scheduled on a different node due to a taint, but if this is an eviction by the kubelet, then that is indeed not guaranteed as there is no taint added to the node

retries are enabled but this did not retry like i expected it would

@tooptoop4 what was your retryStrategy set to? There wasn't a Workflow attached to your issue report

from argo-workflows.

agilgur5 avatar agilgur5 commented on June 3, 2024

Was the node marked as failed? If not, might be fixed by #12197. This is potentially duplicative of #12231 in that case

from argo-workflows.

terrytangyuan avatar terrytangyuan commented on June 3, 2024

Have you tried setting TRANSIENT_ERROR_PATTERN?

from argo-workflows.

agilgur5 avatar agilgur5 commented on June 3, 2024

@terrytangyuan I mentioned that above already. I cc'ed you before as I thought you might have a decision regarding what should and shouldn't be built-in.

Per Alan's comments though, this actually wouldn't fit the criteria of a transient error for the Controller anyway, so I think we have decisive disqualification now. The env var could still potentially be used as a user-land workaround for this kind of thing though.

The retryStrategy not working in this case is potentially a bug though

from argo-workflows.

agilgur5 avatar agilgur5 commented on June 3, 2024

The node failed but the retry didn't run?

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024

correct, node failed but retry didn't run. i have

          retryPolicy: "Always"
          expression: 'lastRetry.status == "Error" or (lastRetry.status == "Failed" and asInt(lastRetry.exitCode) in [255])'

from argo-workflows.

Joibel avatar Joibel commented on June 3, 2024

If the node Failed then your expression requires the exitCode to be 255. Can you verify that was the case?

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024

from what i can see there is no exit code, like the pod could never start. there are no logs for main/init/wait containers

from argo-workflows.

tooptoop4 avatar tooptoop4 commented on June 3, 2024

how can the expression cover a pod that did not start?

from argo-workflows.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.