Comments (13)
I've managed to get some logs 🥳 I took snapshots from the EBS volumes of the broken instance, recreated volumes from them, attached them to a new instance, and we're off to the races. 👍
The systemd unit that failed isn't very helpful log-wise:
[ec2-user@ip-10-0-155-70 vol1]$ sudo journalctl --root /vol1/ -u systemd-networkd-wait-online.service
Jan 23 10:47:53 localhost systemd[1]: Starting Wait for Network to be Configured...
Jan 23 10:49:53 localhost systemd-networkd-wait-online[2582]: Timeout occurred while waiting for network connectivity.
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Jan 23 10:49:53 localhost systemd[1]: Failed to start Wait for Network to be Configured.
I'm attaching the full journalctl
output from boot:
journalctl.log
from bottlerocket.
Hi @yeazelm , thanks for having a look! :)
In the meantime, can you confirm how many instances you have seen this happen on and when it first happened?
Sure! This has happened 7 times in the past 24hrs:
{"level":"DEBUG","time":"2024-01-23T11:02:44.635Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-qshsb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T10:46:45.896Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-6n8nn","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T09:26:41.677Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-4k6tm","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T04:36:35.005Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-rsdc7","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T02:15:55.465Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bbnpz","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T00:28:45.271Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bfllb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-22T21:38:18.128Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-jzzzh","nodepool":"trino","ttl":"15m0s"}
In the same timespan we provisioned 1028 machines.
Did you ever see this type of failure on previous Bottlerocket releases?
Looking back, similar errors have happened in the past at much lower rates and we didn't notice it. I can't confirm whether this was the exact same issue, but we've had the same terminating due to registration ttl
karpenter message...
- Once on December 29th
- Once on December 12th
- 4 times on December 2nd
That's it for the Oct 23rd - Jan 23rd period (we keep 90 days of logs).
from bottlerocket.
@shaharitzko we're running in eu-west-1 (Ireland), this specific instance was in euw1-az3
from bottlerocket.
Hi @yeazelm , I just opened a ticket with the machine IDs from yesterday, hopefully this will shed some light on the issue!
from bottlerocket.
Just a quick update. I did hear from that internal investigation that they still investigating, but I wanted to confirm with @Pluies if you are still experiencing this or if it stopped after Jan 26th?
from bottlerocket.
Update, Instance Connect endpoint didn't work out.
Also I didn't mention it in the original report, if that's any help we're provisioning m7g.8xlarge
instances.
from bottlerocket.
Hello @Pluies, thanks for cutting this issue! These failures are interesting and we are going to take a look. Thanks for attaching the journalctl.log
which has some useful errors to dive into:
Jan 23 10:50:01 localhost kernel: ena 0000:00:05.0 eth0: TX hasn't completed, qid 2, index 13. 127850 msecs since last interrupt, 127850 msecs since last napi execution, napi scheduled: 0
We'll update the issue when we know more. In the meantime, can you confirm how many instances you have seen this happen on and when it first happened? Did you ever see this type of failure on previous Bottlerocket releases?
from bottlerocket.
Thanks for the quick update @Pluies ! That frequency is super helpful in trying to find what might be the cause.
from bottlerocket.
I'm not sure whether this is just a result of the above or a contributing factor, but we're seeing the following error message in the AWS dashboard:
Instance status checks
Instance reachability check failed
Check failure at
2024/01/23 11:50 GMT+1 (1 day)
I'm now going to kill the node above (it's not cheap 😅), which should be fine given we have the volumes + logs.
The issue hasn't reappeared for the past 24h.
from bottlerocket.
@Pluies can you please tell me the region you are using?
from bottlerocket.
@Pluies Could you cut a ticket to AWS support and include the instance ids of the failures you know about and refer to this issue? I think that might help us dig in more.
from bottlerocket.
Update from AWS:
Just as a precaution I have opened an internal investigation into the Bottleneck AMI related network error with the EKS service team and they are looking into a possible kernel bug. Please let me know if you see any reocurrences of the same problem.
We've not seen this happen again since Jan 26th fwiw, I'll update this thread with results of the ticket above. 👍
from bottlerocket.
The issue happens much less often, but we have seen it again on:
- 2024-02-09T00:38:09Z
- 2024-02-18T23:04:09Z
- 2024-02-25T09:39:13Z
- 2024-02-25T13:53:08Z
from bottlerocket.
Related Issues (20)
- core kit migration 🚧 tracking issue HOT 2
- Potential significant max network throughput performance regression HOT 3
- Create an interface for determining the release date of an update HOT 2
- Add the socat package to Bottlerocket
- Need API Setting to allow modify kubelet config for Json logging format HOT 1
- Allow parallel image pulls HOT 5
- `host-ctr` cli crashes when pulling public ECR image HOT 11
- Add `allow2audit` to images HOT 7
- v1.20.3 🐨 Tracking Issue
- Add Kata Containers to images HOT 2
- `cargo make repo` fails after move to `bottlerocket-core-kit` HOT 4
- `dockershim.sock` symlink should be relative HOT 4
- Unresponsive/unreachable Bottlerocket EKS nodes HOT 16
- Enable PodLifecycleSleepAction HOT 1
- Setting cluster-domain has no effect HOT 2
- SELinux Policy: system_u:system_r:cachefiles_kernel_t:s0
- Remove aws-k8s-1.23 variants by October 2024
- Node doesn't expose GPU resource on g4dn.[n]xlarge HOT 3
- Unable to build project inside a docker container HOT 6
- `cargo make ami` does not use environment variable `PUBLISH_AMI_NAME`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bottlerocket.