Giter Club home page Giter Club logo

Comments (13)

Pluies avatar Pluies commented on July 26, 2024 2

I've managed to get some logs 🥳 I took snapshots from the EBS volumes of the broken instance, recreated volumes from them, attached them to a new instance, and we're off to the races. 👍

The systemd unit that failed isn't very helpful log-wise:

[ec2-user@ip-10-0-155-70 vol1]$ sudo journalctl --root /vol1/ -u systemd-networkd-wait-online.service
Jan 23 10:47:53 localhost systemd[1]: Starting Wait for Network to be Configured...
Jan 23 10:49:53 localhost systemd-networkd-wait-online[2582]: Timeout occurred while waiting for network connectivity.
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Jan 23 10:49:53 localhost systemd[1]: Failed to start Wait for Network to be Configured.

I'm attaching the full journalctl output from boot:
journalctl.log

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024 1

Hi @yeazelm , thanks for having a look! :)

In the meantime, can you confirm how many instances you have seen this happen on and when it first happened?

Sure! This has happened 7 times in the past 24hrs:

{"level":"DEBUG","time":"2024-01-23T11:02:44.635Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-qshsb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T10:46:45.896Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-6n8nn","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T09:26:41.677Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-4k6tm","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T04:36:35.005Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-rsdc7","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T02:15:55.465Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bbnpz","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T00:28:45.271Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bfllb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-22T21:38:18.128Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-jzzzh","nodepool":"trino","ttl":"15m0s"}

In the same timespan we provisioned 1028 machines.

Did you ever see this type of failure on previous Bottlerocket releases?

Looking back, similar errors have happened in the past at much lower rates and we didn't notice it. I can't confirm whether this was the exact same issue, but we've had the same terminating due to registration ttl karpenter message...

  • Once on December 29th
  • Once on December 12th
  • 4 times on December 2nd

That's it for the Oct 23rd - Jan 23rd period (we keep 90 days of logs).

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024 1

@shaharitzko we're running in eu-west-1 (Ireland), this specific instance was in euw1-az3

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024 1

Hi @yeazelm , I just opened a ticket with the machine IDs from yesterday, hopefully this will shed some light on the issue!

from bottlerocket.

yeazelm avatar yeazelm commented on July 26, 2024 1

Just a quick update. I did hear from that internal investigation that they still investigating, but I wanted to confirm with @Pluies if you are still experiencing this or if it stopped after Jan 26th?

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024

Update, Instance Connect endpoint didn't work out.

Also I didn't mention it in the original report, if that's any help we're provisioning m7g.8xlarge instances.

from bottlerocket.

yeazelm avatar yeazelm commented on July 26, 2024

Hello @Pluies, thanks for cutting this issue! These failures are interesting and we are going to take a look. Thanks for attaching the journalctl.log which has some useful errors to dive into:

Jan 23 10:50:01 localhost kernel: ena 0000:00:05.0 eth0: TX hasn't completed, qid 2, index 13. 127850 msecs since last interrupt, 127850 msecs since last napi execution, napi scheduled: 0

We'll update the issue when we know more. In the meantime, can you confirm how many instances you have seen this happen on and when it first happened? Did you ever see this type of failure on previous Bottlerocket releases?

from bottlerocket.

yeazelm avatar yeazelm commented on July 26, 2024

Thanks for the quick update @Pluies ! That frequency is super helpful in trying to find what might be the cause.

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024

I'm not sure whether this is just a result of the above or a contributing factor, but we're seeing the following error message in the AWS dashboard:

Instance status checks
Instance reachability check failed
Check failure at
2024/01/23 11:50 GMT+1 (1 day)

I'm now going to kill the node above (it's not cheap 😅), which should be fine given we have the volumes + logs.

The issue hasn't reappeared for the past 24h.

from bottlerocket.

shaharitzko avatar shaharitzko commented on July 26, 2024

@Pluies can you please tell me the region you are using?

from bottlerocket.

yeazelm avatar yeazelm commented on July 26, 2024

@Pluies Could you cut a ticket to AWS support and include the instance ids of the failures you know about and refer to this issue? I think that might help us dig in more.

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024

Update from AWS:

Just as a precaution I have opened an internal investigation into the Bottleneck AMI related network error with the EKS service team and they are looking into a possible kernel bug. Please let me know if you see any reocurrences of the same problem.

We've not seen this happen again since Jan 26th fwiw, I'll update this thread with results of the ticket above. 👍

from bottlerocket.

Pluies avatar Pluies commented on July 26, 2024

The issue happens much less often, but we have seen it again on:

  • 2024-02-09T00:38:09Z
  • 2024-02-18T23:04:09Z
  • 2024-02-25T09:39:13Z
  • 2024-02-25T13:53:08Z

from bottlerocket.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.