What happened? We have encountered a situation when a Job controll

I found that it was introduced in 1.22: <a href="https://github.com/kubernetes/kuberne

/sig apps cc <a class="user-mention notranslate" data-hovercard-type="user" data-h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Job may get stuck repeatedly failing with Duplicate value message for uncountedTerminatedPods.failed about kubernetes HOT 26 CLOSED

mimowo commented on September 27, 2024

Job may get stuck repeatedly failing with Duplicate value message for uncountedTerminatedPods.failed

from kubernetes.

Comments (26)

SergeyKanzhelev commented on September 27, 2024 2

I found that it was introduced in 1.22: https://github.com/kubernetes/kubernetes/pull/102344/files#diff-67dd9e8f3ee257072765326cb4f242852554a2c0753563fa51e292c0a63a7b94L2112 It may be more visible because of some finalizers are cleaning up slower of something. But I think the bug is critical enough to justify the cherry-picks

from kubernetes.

KPostOffice commented on September 27, 2024 1

I looked into this some today and it looks like there was a check added in the last month here

From this PR by @SergeyKanzhelev . From what I understand of the problem, this fix should close this issue as well.

from kubernetes.

mimowo commented on September 27, 2024

/sig apps
cc @alculquicondor @atiratree

from kubernetes.

mimowo commented on September 27, 2024

FYI we had a similar issue in the past with handling pods which transitioned from being categorized as Failed to Succeeded: #111646. This was easier to reproduce as Job controller itself would handle pods with deletionTimestamp as Failed, but they could end up Succeeded eventually. The situation here seems to involve the opposite - the pod first is counted as Succeeded, then as Failed.

from kubernetes.

alculquicondor commented on September 27, 2024

@liggitt wdyt about making the job controller resilient to a flip of phase from Succeeded to Failed, as opposed to fixing the underlying problem (which we haven't been able to identify yet)?

Ideally, changes of the phase should have been prevented by validation, but I know that could be very disruptive.

cc @tallclair @bobbypage for any thoughts on how the kubelet could have made a decision about transition the Pod to Failed after it was declared as Succeeded.

from kubernetes.

liggitt commented on September 27, 2024

@smarterclayton I thought validation was added to prevent phase flips away from terminal phases

Things like this:

I suspect there are more actors than just job controller that would not handle transitions out of terminal phases correctly ... I don't love the idea of compensating for violation of an invariant in one controller

from kubernetes.

alculquicondor commented on September 27, 2024

Here we are talking about the Pod phase, not the container.

Regardless, in the status above we see:

  containerStatuses:
  - state:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        startedAt: null

So I am not seeing how the phase would have been Suceeded at any point.

@mimowo can you double check whether there are other conditions in which the job controller could consider a Pod succeeded even if it doesn't have the Succeeded phase?

Alternatively.... maybe the update didn't come from a kubelet? 🤔

from kubernetes.

alculquicondor commented on September 27, 2024

Quickly looking at the job controller code, it looks like the only way for the job controller to consider a Pod succeeded is Pod.Status.Phase == Succeeded.

So it seems that these Pods were indeed Succeeded at some point, but we couldn't validate via audit logs, as status updates are not recorded.

Additionally, in the affected Pods we saw:

  conditions:
  - message: 'Taint manager: deleting due to NoExecute taint'
    reason: DeletionByTaintManager
    status: "True"
    type: DisruptionTarget
...
  message: 'Pod was rejected: Predicate NodeAffinity failed'
  phase: Failed

which might suggest that this is happening during a Node drain or shutdown.

from kubernetes.

mimowo commented on September 27, 2024

@smarterclayton I thought validation was added to prevent phase flips away from terminal phases

Things like this:

api: validate container phase transitions #54530

kubelet: check for illegal container state transition #54597

I suspect there are more actors than just job controller that would not handle transitions out of terminal phases correctly ... I don't love the idea of compensating for violation of an invariant in one controller

IIUC this validation is for container statuses. We observed, based on audit logs, that pod's phase can transition under some circumstances from Succeeded to Failed. All transitions are done by Kubelet.

In out scenario (found a couple cases like that) the transitioning to Failed is always due to rejecting a pod, after Kubelet restart.

Also, it seems the validation prevents updating from terminal container status to running, but transition to another terminal (completely differnt) is still possible.

from kubernetes.

mimowo commented on September 27, 2024

/sig node
To help us triage, the fact that Kubelet can flip pods phase from Succeeded to Failed seems like a bug.
/cc @SergeyKanzhelev @bobbypage

from kubernetes.

SergeyKanzhelev commented on September 27, 2024

Is it 1.30 only? Or repro on earlier versions?

This indeed looks like a bug

from kubernetes.

tallclair commented on September 27, 2024

Some interesting additional context from a case we saw:

2024-05-23 21:37:42.048 - PATCH by Kubelet setting `ContainerCreating` [1]
2024-05-23 21:38:18.537 - PATCH by Kubelet setting the phase as Running [2]
2024-05-23 21:38:38.471 - PATCH by Kubelet setting the container exit code as 0 [3]
2024-05-23 21:41:01.002 - Kubelet restart [4]
2024-05-23 21:42:16.332 - PATCH by Kubelet setting phase as Succeeded
2024-05-23 21:42:17.394 - Job status update adding the succeeded pod - based on the timestamp in managed-fields, matched with logs [6]
2024-05-23 21:42:18.336 - finalizer removal by Job controller rejected by the validation.gatekeeper.sh webhook [7]
2024-05-23 21:47:59.045 - PATCH by Kubelet which sets the phase as Failed, reason OutOfPods [8]

I think a reason of OutOfPods can only be set in pod admission, called from HandlePodAdditions. That means somehow Kubelet is setting the pod phase to succeeded, but then still subsequently handling that pod as an addition.

from kubernetes.

mimowo commented on September 27, 2024

The saw at least one affected cluster by Kubelet flipping the phase in 1.27, 1.28 and 1.29.

from kubernetes.

tallclair commented on September 27, 2024

I think this comment is probably referring to the cause of this issue:

kubernetes/pkg/kubelet/kubelet.go

Lines 2437 to 2441 in a7ca13e

 // After restarting, kubelet will get all existing pods through 

 // ADD as if they are new pods. These pods will then go through the 

 // admission process and *may* be rejected. This can be resolved 

 // once we have checkpointing. 

 handler.HandlePodAdditions(u.Pods)

from kubernetes.

mimowo commented on September 27, 2024

IIUC it means in this code does not protect against it:

kubernetes/pkg/kubelet/kubelet_pods.go

Lines 1817 to 1825 in bb2068b

 // pods are not allowed to transition out of terminal phases 

 if pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded { 

 // API server shows terminal phase; transitions are not allowed 

 if s.Phase != pod.Status.Phase { 

 klog.ErrorS(nil, "Pod attempted illegal phase transition", "pod", klog.KObj(pod), "originalStatusPhase", pod.Status.Phase, "apiStatusPhase", s.Phase, "apiStatus", s) 

 // Force back to phase from the API server 

 s.Phase = pod.Status.Phase 

 } 

 }

. Maybe because after kubelet restart the API server pod status is not initialized yet?

from kubernetes.

SergeyKanzhelev commented on September 27, 2024

also from @smarterclayton:

the problem is status manager makes a blind write to the API on kubelet restart without checking / preconditioning on the API server having the expected phase. That's definitely a kubelet bug.

from kubernetes.

AnishShah commented on September 27, 2024

sig-node meeting notes:

/triage accepted
/priority important-longterm

from kubernetes.

alculquicondor commented on September 27, 2024

@SergeyKanzhelev @tallclair is there anyone working on this?

Otherwise we might need to look into a mitigation in the job controller side, like following suit and honoring the last update from kubelet.

from kubernetes.

mimowo commented on September 27, 2024

This is great news! @SergeyKanzhelev @smarterclayton does it match your expectation that the PR would prevent kubelet from flipping the phase from Succeeded to Failed?

In that case we could consider closing the issue, as all known cases fall into this category. Unless, we want to make Job controller resilient to potential race conditions of kubelet and other controllers that would want to set the opposite phase. Any opinions @alculquicondor @atiratree ?

from kubernetes.

SergeyKanzhelev commented on September 27, 2024

This is great news! @SergeyKanzhelev @smarterclayton does it match your expectation that the PR would prevent kubelet from flipping the phase from Succeeded to Failed?

Yes, but only in this specific case - if this was a re-admission failure

The saw at least one affected cluster by Kubelet flipping the phase in 1.27, 1.28 and 1.29.

I am trying to understand whether it regressed at some point as I want to cherry-pick the fix. Do you know about 1.26?

from kubernetes.

mimowo commented on September 27, 2024

Don't know definitely, but likely a regression, because I saw 10 clusters with stuck Jobs due to this issue, and all were 1.27+.

from kubernetes.

SergeyKanzhelev commented on September 27, 2024

/priority critical-urgent

from kubernetes.

alculquicondor commented on September 27, 2024

+1 to cherry-picking to all supported versions

from kubernetes.

mimowo commented on September 27, 2024

+1 for cherry-picking

from kubernetes.

alculquicondor commented on September 27, 2024

/close
cherry-picks are merged

from kubernetes.

k8s-ci-robot commented on September 27, 2024

@alculquicondor: Closing this issue.

In response to this:

/close
cherry-picks are merged

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

from kubernetes.

Job may get stuck repeatedly failing with Duplicate value message for uncountedTerminatedPods.failed about kubernetes HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// After restarting, kubelet will get all existing pods through
	// ADD as if they are new pods. These pods will then go through the
	// admission process and may be rejected. This can be resolved
	// once we have checkpointing.
	handler.HandlePodAdditions(u.Pods)

	// pods are not allowed to transition out of terminal phases
	if pod.Status.Phase == v1.PodFailed \|\| pod.Status.Phase == v1.PodSucceeded {
	// API server shows terminal phase; transitions are not allowed
	if s.Phase != pod.Status.Phase {
	klog.ErrorS(nil, "Pod attempted illegal phase transition", "pod", klog.KObj(pod), "originalStatusPhase", pod.Status.Phase, "apiStatusPhase", s.Phase, "apiStatus", s)
	// Force back to phase from the API server
	s.Phase = pod.Status.Phase
	}
	}