Giter Club home page Giter Club logo

Comments (6)

beyhan avatar beyhan commented on July 19, 2024

@max-soe is there a bosh director version since which you observe this behaviour?

from bosh.

max-soe avatar max-soe commented on July 19, 2024

No, as it is only a false-positive alert, nobody looked into that. But in the last month the number of alerts increased. My assumption is that we meanwhile use faster and faster vm types and with that this happens more often.

from bosh.

jpalermo avatar jpalermo commented on July 19, 2024

I think the health monitor logs all the VM NATS updates. So you should be able to reconstruct a history of what happened and how it ended up in this state. Are the updates not being received/processed by the health monitor? Is it a timing issue?

from bosh.

max-soe avatar max-soe commented on July 19, 2024

Yes i have the full log, but it's to big for this thread... But i can summaries it.

  1. Start of health-monitoring after the bosh director is recreated:
I, [2024-06-04T00:28:59.173218 #7]  INFO : Connected to NATS at 'nats://bosh.cf-eu10-003.internal:4222'
I, [2024-06-04T00:28:59.173585 #7]  INFO : Logging delivery agent is running...
I, [2024-06-04T00:28:59.173828 #7]  INFO : Event logger is running...
I, [2024-06-04T00:28:59.174100 #7]  INFO : Resurrector is running...
I, [2024-06-04T00:28:59.175149 #7]  INFO : HTTP server is starting on port 25923...
I, [2024-06-04T00:28:59.201964 #7]  INFO : BOSH HealthMonitor 0.0.0 is running...
I, [2024-06-04T00:28:59.202087 #7]  INFO : connection.graphite-connected
I, [2024-06-04T00:28:59.761777 #7]  INFO : Resurrection config remains the same
  1. Found deployments and adding agents:
I, [2024-06-04T00:28:59.802495 #7]  INFO : Found deployment 'app-autoscaler'
I, [2024-06-04T00:28:59.881586 #7]  INFO : Adding agent e35a914e-233a-44c2-a01a-bb52bdd7faa3 (metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d) to app-autoscaler...
  1. Start analysing and reporting some alerts:
I, [2024-06-04T00:30:29.179033 #7]  INFO : Analyzing agents...
I, [2024-06-04T00:30:29.179254 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
I, [2024-06-04T00:30:29.251761 #7]  INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
  1. Meltdown alerts:
I, [2024-06-04T00:30:29.256503 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
I, [2024-06-04T00:30:29.256565 #7]  INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
I, [2024-06-04T00:30:29.308410 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 1: Skipping resurrection for instances: metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d, metricsforwarder/fb8e7a9b-d7d0-47cd-bfd4-f1e59b880f35, metricsserver/fe5527ab-57fa-4c24-a457-acebe8e65edb, metricsgateway/bd0b06a0-180e-4a1b-8d3e-0588c5eeefe7, operator/d199c4db-5d05-456b-91c0-ed634d40bd07, operator/f40b3de1-6e01-4e8a-8a42-ab99336536f1, eventgenerator/7bb09c5b-a81f-47ac-bc3c-57ce20e25c54, scalingengine/dda1e1f7-3290-43eb-b605-1e040614966f; deployment: 'app-autoscaler'; 8 of 23 agents are unhealthy (34.8%)

If the issue not occur, the reports are similar except the meltdown alerts. We getting timeout alerts on all landscapes, but they not hit everytime the threshold. As the other deployments like here autoscaler are not updated, i think this is a timing issue. But how can we fix that to avoid this false positives?

from bosh.

jpalermo avatar jpalermo commented on July 19, 2024

Working backward. To get the meltdown alert, the agents need to be timed out.

Agent objects, when they are created, are seeded with the current time as their updated_at value. Which is updated to the current time any time the agent heartbeats.

So why didn't the health monitor get a heartbeat from the agent before it was timed out? You can check the health monitor logs to see when it got heartbeats from the agent. Agents send heartbeats every 30 seconds, so you should be able to work forward or backward from other heartbeats in the logs to see what the health monitor was doing when it should have received it.

Might also check the agent logs in this situation to see if there was any sort of error when sending the heartbeat.

from bosh.

max-soe avatar max-soe commented on July 19, 2024

We cannot reproduce this issue, after we migrated from the bosh version 280.0.20 to 280.0.25. Maybe this was fixed by the metric server improvements. I will close this ticket for now.

from bosh.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.