Comments (9)
Do you happen to have more complete set of Karpenter controller logs from the time when this happened?
from karpenter-provider-aws.
Do you happen to have more complete set of Karpenter controller logs from the time when this happened?
not much info, except for the messages that the node got deleted
22:18:57.334 disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-11-130-134.ec2.internal/m5a.2xlarge/on-demand
...
03:49:10.814 deleted node
from karpenter-provider-aws.
From the logs it seems like the pod that was orphaned is prod/hdr-service-app-c9cdb8dbf-w2hr2
. Just wanted to confirm that the pod spec that you have shared is same for this pod since the deployment is called test
in that.
from karpenter-provider-aws.
From the logs it seems like the pod that was orphaned is
prod/hdr-service-app-c9cdb8dbf-w2hr2
. Just wanted to confirm that the pod spec that you have shared is same for this pod since the deployment is calledtest
in that.
yeah sorry. the attached log was from the original issue with a production service. but basically prod/hdr-service-app-c9cdb8dbf-w2hr2 had the same issue.
here are the logs from the reproduces issue test:
E0622 04:08:15.200303 11 gc_controller.go:154] failed to get node ip-10-11-57-209.ec2.internal : node "ip-10-11-57-209.ec2.internal" not found
I0622 04:09:15.225784 11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="kube-system/aws-node-d47hp" node="ip-10-11-57-209.ec2.internal"
I0622 04:09:15.281401 11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="test/test-5ccdb7cd7f-dm9bq" node="ip-10-11-57-209.ec2.internal"
I0622 04:09:15.303497 11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="kube-system/ebs-csi-node-xt68r" node="ip-10-11-57-209.ec2.internal"
I0622 04:08:07.894293 10 node_tree.go:79] "Removed node in listed group from NodeTree" node="ip-10-11-57-209.ec2.internal" zone="us-east-1:\x00:us-east-1a"
--
from karpenter-provider-aws.
terminationGracePeriodSeconds: 43200 #6hrs , 43200 - 12hrs
The deployment you shared has this. Is there a reason that this comment says 6 hours? Was terminationGracePeriod set to 6 hours or 12?
from karpenter-provider-aws.
terminationGracePeriodSeconds: 43200 #6hrs , 43200 - 12hrs
The deployment you shared has this. Is there a reason that this comment says 6 hours? Was terminationGracePeriod set to 6 hours or 12?
terminationGracePeriodSeconds is set to 12hrs (43200 seconds). we need it as on rare occasions the pod can't be interupted for up to 12 hours.
from karpenter-provider-aws.
I believe at this point it would make sense to go over the cluster audit logs to see if there's something that indicates what went wrong. Do you mind opening a support ticket to facilitate this?
from karpenter-provider-aws.
I believe at this point it would make sense to go over the cluster audit logs to see if there's something that indicates what went wrong. Do you mind opening a support ticket to facilitate this?
sorry, what kind of support ticket do you mean? we don't have AWS premium support. I believe that behavior can be reproduced on any cluster.
from karpenter-provider-aws.
I tried to reproduce this issue with the config that you have shared but couldn't reproduce it. Karpenter didn't remove the node until terminationGracerPeriod
was hit. At this point, I think it would make sense to have a look at more complete set of Karpenter controller logs from the time when this happened.
from karpenter-provider-aws.
Related Issues (20)
- Migrating pods from One NodePool to Another HOT 1
- `make setup` has been deleted but still exists in the document
- GPU workload pod failed to schedule due to low-priority gpu placeholder daemonset HOT 2
- Allocatable memory less than actual memory during scheduling (trn1.32x) HOT 2
- Upgrade process for Karpenter HOT 1
- Support pod PID limits through NodePool spec.template.spec.kubelet
- EC2 capacity blocks for ML in Karpenter Nodepools
- Pods were CrashLoopBackOff in provisioned node by karpenter(rare occurrences)
- Get used AMI-Images by metrics
- EC2 Instances configuration messed up when spawned by Karpenter, they fail to register to the cluster
- Metric karpenter_interruption_received_messages(message_type="SpotInterruptionKind") is not accurate
- karpenter unable to spin up nodes for statefulset
- Manage CRDs with Kustomize
- Karpenter frequently pulls up nodes and pods cannot be scheduled HOT 1
- no service port 8443 found for service "karpenter" after migrating from cluster autoscaler HOT 1
- Option to disable setting default blockDeviceMappings HOT 1
- Resolving STS Credentials with I/O Timeout - use karpenter DNS and no CoreDNS?
- Karpenter metrics documentation should be improved
- Node Metrics included nodes not managed by Karpenter
- Distribute workloads across multiple Availability Zones to mitigate the risk of single-zone failure. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from karpenter-provider-aws.