Comments (6)
@max-soe is there a bosh director version since which you observe this behaviour?
from bosh.
No, as it is only a false-positive alert, nobody looked into that. But in the last month the number of alerts increased. My assumption is that we meanwhile use faster and faster vm types and with that this happens more often.
from bosh.
I think the health monitor logs all the VM NATS updates. So you should be able to reconstruct a history of what happened and how it ended up in this state. Are the updates not being received/processed by the health monitor? Is it a timing issue?
from bosh.
Yes i have the full log, but it's to big for this thread... But i can summaries it.
- Start of health-monitoring after the bosh director is recreated:
I, [2024-06-04T00:28:59.173218 #7] INFO : Connected to NATS at 'nats://bosh.cf-eu10-003.internal:4222'
I, [2024-06-04T00:28:59.173585 #7] INFO : Logging delivery agent is running...
I, [2024-06-04T00:28:59.173828 #7] INFO : Event logger is running...
I, [2024-06-04T00:28:59.174100 #7] INFO : Resurrector is running...
I, [2024-06-04T00:28:59.175149 #7] INFO : HTTP server is starting on port 25923...
I, [2024-06-04T00:28:59.201964 #7] INFO : BOSH HealthMonitor 0.0.0 is running...
I, [2024-06-04T00:28:59.202087 #7] INFO : connection.graphite-connected
I, [2024-06-04T00:28:59.761777 #7] INFO : Resurrection config remains the same
- Found deployments and adding agents:
I, [2024-06-04T00:28:59.802495 #7] INFO : Found deployment 'app-autoscaler'
I, [2024-06-04T00:28:59.881586 #7] INFO : Adding agent e35a914e-233a-44c2-a01a-bb52bdd7faa3 (metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d) to app-autoscaler...
- Start analysing and reporting some alerts:
I, [2024-06-04T00:30:29.179033 #7] INFO : Analyzing agents...
I, [2024-06-04T00:30:29.179254 #7] INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
I, [2024-06-04T00:30:29.251761 #7] INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
- Meltdown alerts:
I, [2024-06-04T00:30:29.256503 #7] INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
I, [2024-06-04T00:30:29.256565 #7] INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
I, [2024-06-04T00:30:29.308410 #7] INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 1: Skipping resurrection for instances: metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d, metricsforwarder/fb8e7a9b-d7d0-47cd-bfd4-f1e59b880f35, metricsserver/fe5527ab-57fa-4c24-a457-acebe8e65edb, metricsgateway/bd0b06a0-180e-4a1b-8d3e-0588c5eeefe7, operator/d199c4db-5d05-456b-91c0-ed634d40bd07, operator/f40b3de1-6e01-4e8a-8a42-ab99336536f1, eventgenerator/7bb09c5b-a81f-47ac-bc3c-57ce20e25c54, scalingengine/dda1e1f7-3290-43eb-b605-1e040614966f; deployment: 'app-autoscaler'; 8 of 23 agents are unhealthy (34.8%)
If the issue not occur, the reports are similar except the meltdown alerts. We getting timeout alerts on all landscapes, but they not hit everytime the threshold. As the other deployments like here autoscaler are not updated, i think this is a timing issue. But how can we fix that to avoid this false positives?
from bosh.
Working backward. To get the meltdown alert, the agents need to be timed out.
Agent objects, when they are created, are seeded with the current time as their updated_at
value. Which is updated to the current time any time the agent heartbeats.
So why didn't the health monitor get a heartbeat from the agent before it was timed out? You can check the health monitor logs to see when it got heartbeats from the agent. Agents send heartbeats every 30 seconds, so you should be able to work forward or backward from other heartbeats in the logs to see what the health monitor was doing when it should have received it.
Might also check the agent logs in this situation to see if there was any sort of error when sending the heartbeat.
from bosh.
We cannot reproduce this issue, after we migrated from the bosh version 280.0.20 to 280.0.25. Maybe this was fixed by the metric server improvements. I will close this ticket for now.
from bosh.
Related Issues (20)
- Resurrector not resurrecting unresponsive agent. HOT 7
- Multi-cpi with different iaas bosh cpi releases induce ruby package conflict HOT 2
- Default bosh generated x509 certificates have invalid 3 digits USA country code HOT 6
- Support Alibaba OSS as an external blobstore for bosh HOT 5
- Improve support for diagnostics of failed compilation: flag to preserve compilation source packages and logs HOT 2
- How to get vm_cid in VM? HOT 1
- Failed on upgrading BOSH Director from v271.2.0 to v280.0.14 HOT 4
- Non-descriptive error message when a BOSH job spec property name is a prefix for another one HOT 3
- Support for updating disks HOT 4
- Cannot connect to Bosh Director HOT 5
- Retention period of task logs HOT 2
- health_monitor is leaking connections
- panic: Internal inconsistency: Expected len(Interface '(.+)' was successfully created matches) >= 3: HOT 4
- Health_Monitor stop sending logs HOT 2
- Health-Monitor fails to start because of NATS? HOT 2
- BOSH Health Monitor JSON Pluging Not Working HOT 2
- unable to bosh cck an unresponsive vm (very high cpu load) HOT 2
- Api endpoint to get deployment manifest with expanded runtime config HOT 2
- mount option for ephemeral disk (alternative to cron scheduling fstrim) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bosh.