Describe the bug I've noticed that about 10% of our

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

In the same way as the deployment and other jobs: <div class="highlight highlight-

Possible race condition with Kustomize, cron jobs and config maps about argo-cd HOT 31 CLOSED

nickjj commented on August 16, 2024

Possible race condition with Kustomize, cron jobs and config maps

from argo-cd.

Comments (31)

christianh814 commented on August 16, 2024

@nickjj Are you using a Pre or Post sync hook for the CronJob or is it part of the normal Sync?

from argo-cd.

nickjj commented on August 16, 2024

It's part of a normal sync.

from argo-cd.

nickjj commented on August 16, 2024

It may be worth pointing out the config map has a sync-wave value of -30. I wanted to make sure the config map gets updated before everything else since other resources (jobs, deployments, etc.) depend on it. After that SealedSecrets has -20 and everything else has the default value.

from argo-cd.

crenshaw-dev commented on August 16, 2024

The cron job spawns a new job with myapp-abc123 and that doesn't exist anymore so we get the error

How is it possible for the CronJob to still reference myapp-abc123 instead of the new myapp-xyz123? Is there that large a time gap between the sync waves so that a new Job is triggered based on the not-yet-synced CronJob?

from argo-cd.

nickjj commented on August 16, 2024

How is it possible for the CronJob to still reference myapp-abc123 instead of the new myapp-xyz123?

I'm not sure how it's possible. Maybe it has to do with how it's scheduled? We have around 12 cron jobs and this one cron job is the only one where this happens but it's configured the same way as the others. The only difference is this cron job happens to run for ~15 minutes and runs every 30 minutes.

The time gap is quite small. It usually takes 1-3 seconds to progress to the point where new pods based on deployments and jobs are coming up.

from argo-cd.

crenshaw-dev commented on August 16, 2024

How does the CronJob reference the ConfigMap?

from argo-cd.

nickjj commented on August 16, 2024

In the same way as the deployment and other jobs:

envFrom:
- configMapRef:
    name: "hello-world-app"

In the kustomization.yaml for each app I use this:

configMapGenerator:
- behavior: "merge"
  envs:
  - ".env"
  name: "hello-world-app"

In this case the .env file is just a new line separated list of environment variables.

from argo-cd.

crenshaw-dev commented on August 16, 2024

So Kustomize is responsible for making sure the generated names match.

I think we need a clearer picture of the exact scenario that's producing the mismatch between the Job spec and the existing ConfigMap. Argo CD's responsibility ends at applying the specified manifests. The conditions which produce the mismatch are what we need to understand and resolve.

from argo-cd.

nickjj commented on August 16, 2024

Beyond what's in the bullet points of the workflow causing this, is there anything I can add to this issue?

from argo-cd.

nickjj commented on August 16, 2024

Here's a few screenshots in order of when it happened recently. Sorry for the small font size on some of them. Let's just say Outlook and OneDrive are bad and resized them when I emailed them from work to my personal machine. Fortunately the text is still readable (barely) if you zoom in.

But this demonstrates how the CronJob references the new config map ID but both the Job and Pod it spawned do not.

The new deployment has a new config map suffix starting with -gcd where as the old deployment was -962.

The only things I've partially redacted are names of certain resources since it would show my employer's company name and some of the things we're doing based on the cron job name.

ConfigMap: new

Pod: config map error

Pod: summary error

Pod: events error

CronJob: manifest

Job: manifest

Job: stuck progessing due to failing pod

Pod: manifest

from argo-cd.

crenshaw-dev commented on August 16, 2024

The system as designed is stateful. The ConfigMap, because of the randomized name, is effectively ephemeral. So the Job (which may persist after the ephemeral ConfigMap is deleted) depends on state which may change.

You'll need to either make the Job not depend on the ephemeral resource, or build something to clean up the problematic Jobs.

Is it actually necessary to use a generated name for the ConfigMap? Could the name instead be static?

from argo-cd.

nickjj commented on August 16, 2024

The job requires the config map because it spawns a pod with environment variables that are necessary for the workload to be completed.

The generated config map name was helpful for letting Kubernetes know its values were changed. For example if we changed an environment variable and pushed that, Argo CD picks up the change and any new resources that reference it automatically got re-deployed. If we use a static name, from what I remember that re-deployment doesn't occur automatically because Kubernetes doesn't know it changed.

Any suggestions on an event based automated way to auto-delete failing jobs?

from argo-cd.

nickjj commented on August 16, 2024

#13642 was a suggestion to allow for static configs and have auto-rollouts but it got closed with a recommendation to use a third party tool. This feels like a pretty common use case to have a cron job that references a config map. Any possibility of this issue being investigated again given some of the other suggestions in that thread say to use Kustomize's config map generator which doesn't work which is demonstrated here?

I actually use the checksum approach for secrets because we have a custom CLI tool to interact with the secrets which updates the checksum values as needed but the ConfigMap story is much different because it's just a plain text file on disk that exists in the repo. Ideally you'd want to update that file, commit and push it. Having a developer have to go in and manually calculate and then update a dozen checksum annotations wouldn't be a good outcome IMO.

from argo-cd.

crenshaw-dev commented on August 16, 2024

Do you need already-running Jobs to pick up the new config, or just new Jobs spawned by the CronJob after syncing?

If the latter, you should be able to use a static ConfigMap name. New Jobs will pick up the newly-synced ConfigMap as soon as they're spawned.

Handling config updates for Deployments as discussed in #13642 is a significantly different problem.

from argo-cd.

nickjj commented on August 16, 2024

The same config map used by the cron jobs is used by our main deployment which is a web app. In our case it's kind of related.

I suppose the outcome I'm looking for is it to work in a way where it doesn't get itself into a failure state. If a cron job triggered a job which triggered a pod and it's currently running, I'm not sure why deploying a new version of our app will cause the pod to fail with the config map error when it's already running with the old one.

In theory shouldn't the pod that's already running have everything it needs to continue running even if the old config map got pruned by Argo CD?

from argo-cd.

crenshaw-dev commented on August 16, 2024

The same config map used by the cron jobs is used by our main deployment which is a web app.

I'd recommend using something like Reloader for the Deployment. Or you could split the config into two ConfigMaps: one with name randomization to trigger the Deployment restart, and the other without name randomization for the CronJob to use.

Right now, you have two resources with very different runtime behavior depending on the same ConfigMap. In one case, an ephemeral ConfigMap is fine (and even preferable if you want a quick-and-easy restart mechanism), and in the other case, an ephemeral ConfigMap is unacceptable because stateful resources (Jobs) depend on the ConfigMap persisting.

I'm not sure why deploying a new version of our app will cause the pod to fail with the config map error when it's already running with the old one.

In theory shouldn't the pod that's already running have everything it needs to continue running even if the old config map got pruned by Argo CD?

Because the Job doesn't load the ConfigMap, the Pod does. If the Job needs to retry for some reason, it will spawn a new Pod, and the Pod will not be able to load the now-deleted ephemeral ConfigMap.

You'd hit the same problem if a a new Pod belonging to an old RepliaSet depended on an ephemeral ConfigMap. Imagine you deploy a new image, and the Deployment spins up a new ReplicaSet which fails with an ImagePullBackoff. If a Pod belonging to the old ReplicaSet were evicted, a new one would be spun up, and the new Pod would fail to load the now-deleted ephemeral ConfigMap.

Randomized ConfigMap names is just a bad way to force workload restarts. Other tools exist because a more elegant solution is actually necessary.

from argo-cd.

nickjj commented on August 16, 2024

If the Job needs to retry for some reason, it will spawn a new Pod, and the Pod will not be able to load the now-deleted ephemeral ConfigMap.

That's the interesting part. The job didn't need to retry, it was happily running before the new deployment.

I'd recommend using something like Reloader for the Deployment.

Thanks. I'll check into this and only have it enabled for the deployment.

from argo-cd.

crenshaw-dev commented on August 16, 2024

The job didn't need to retry, it was happily running before the new deployment.

That is weird. The Pod events don't make sense to me. Why was the image pulled so many times? And why was it so long from the initial pull attempt until the first successful pull? Was the referenced ConfigMap deleted in that ~10min window?

from argo-cd.

nickjj commented on August 16, 2024

It's hard to say why it was pulled that much. Maybe each time K8s re-tried, it re-pulled? Right now a job is running successfully and the pod shows 1 pull with 1 container start. This job pretty much never fails, it runs a lot to pickup new work.

Initial pull vs successful pull is likely due to it being in a stuck state for a while. After I took all of the screenshots I manually deleted the job which fixed everything. I didn't manually delete the ConfigMap, Argo CD must have pruned it after the app was synced successfully.

from argo-cd.

crenshaw-dev commented on August 16, 2024

It's also odd that they're are more container failure events than ConfigMap error events. I wonder if there was an initial failure after the image was finally pulled and if the container restart triggered the ConfigMap reload which failed due to the ConfigMap having been pruned.

It's difficult to reconstruct the events. But either way, it's dangerous to leave dangling references to an ephemeral resource.

from argo-cd.

nickjj commented on August 16, 2024

Is there anything at the K8s level that would trigger the job / pod to restart even if it didn't fail when a new deployment went out?

from argo-cd.

crenshaw-dev commented on August 16, 2024

Nothing I can think of off-hand... but glancing at the Job spec in the screenshot, I realize it's got a lot of functionality I've never used.

from argo-cd.

crenshaw-dev commented on August 16, 2024

I do think the 10min gap between first pull and first run is plenty of time to trigger the race condition, even without a container restart.

from argo-cd.

nickjj commented on August 16, 2024

Most of the options are the default. I've set concurrencyPolicy: "forbid" on the CronJob and the Job only has a backoff of 3 and to restart on failure, the rest are defaults.

We've never had this fail other than during a deployment, for like 2 years.

I'm not sure what would cause the 10min gap between first pull and first run. The image being pulled is the official'ish curl image. We have other cron jobs using the same image and they finish in a few seconds end to end.

from argo-cd.

crenshaw-dev commented on August 16, 2024

With the 3 retries configured, it's possible the container fails sometimes, but has never exceed the 3 retry threshold.

A failed run/failure/retry isn't the only thing that may introduce sufficient time to trigger the race condition. The slow pull is another. Basically anything that adds time between "reference broken by resource prune" to "failed use of dangling reference" can trigger the race condition.

I'm going to go ahead and close the issue since Argo CD is behaving correctly. Eliminating the possibility of dangling references is outside the scope of Argo's purpose, which is to apply declarative config. Occasional failures are an inherent part of a declarative, eventually-consistent system. If the failures caused by dangling references to the ephemeral ConfigMap are unacceptable, then you'll need to rework how that config is referenced to make the dangling reference less likely.

Happy to keep brainstorming here, just closing for bookkeeping purposes.

from argo-cd.

nickjj commented on August 16, 2024

Thanks. I may just configure Argo CD not to prune the config maps and have a mental note to clean them up every quarter. That would at least bypass needing to run another tool (Reloader) in our cluster and be responsible for its configuration and updates.

from argo-cd.

crenshaw-dev commented on August 16, 2024

Yeah, that's probably easiest. And I bet there are generic ttl controllers that can clean them up for you and avoid the memory item.

from argo-cd.

nickjj commented on August 16, 2024

You mean something like this, https://github.com/TwiN/k8s-ttl-controller? Looks like 1 more thing to install and configure tho.

I wonder how much memory will be used if ~50-75 old config maps were sitting around. Most of them have about 10kb of data. Do you know offhand if they all sit in memory somewhere?

from argo-cd.

crenshaw-dev commented on August 16, 2024

Yep, exactly like that. Indeed, another thing to maintain.

They'll all sit around in etcd. 50 ConfigMaps should be no big deal. The most annoying part will be your app being forever OutOfSync.

from argo-cd.

nickjj commented on August 16, 2024

Hmm, I didn't know it would be forever OutOfSync.

It looks like there's also #1636 which is open but for almost 5 years.

from argo-cd.

crenshaw-dev commented on August 16, 2024

Ah that's a good idea.

Not sure I love #1636, we'd have to be in the business of understanding every way in which one resource can reference another resource. Seems like a lot of work for low value, and fairly risky.

from argo-cd.

Possible race condition with Kustomize, cron jobs and config maps about argo-cd HOT 31 CLOSED

Comments (31)

ConfigMap: new

Pod: config map error

Pod: summary error

Pod: events error

CronJob: manifest

Job: manifest

Job: stuck progessing due to failing pod

Pod: manifest

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent