cloudfoundry-attic / hm9000 Goto Github PK

License: Apache License 2.0

Go 100.00%

hm9000's Introduction

Please read before you submit any issue or PR

HM9000 will only be updated in response to vulnerability discoveries and major bugs. No new features will be introduced during this period.

See EOL Timeline for Legacy DEA Backend

Health Manager 9000

HM 9000 is a rewrite of CloudFoundry's Health Manager. HM 9000 is written in Golang and has a more modular architecture compared to the original ruby implementation. HM 9000's dependencies are locked down in a separate repo, the hm-workspace.

There are several Go Packages in this repository, each with a comprehensive set of unit tests. In addition there is an integration test that exercises the interactions between the various components. What follows is a detailed breakdown.

HM9000's Architecture and High-Availability

HM9000 solves the high-availability problem by relying on etcd, a robust high-availability store distributed across multiple nodes. Individual HM9000 components are built to rely completely on the store for their knowledge of the world. This removes the need for maintaining in-memory information and allows clarifies the relationship between the various components (all data must flow through the store).

To avoid the singleton problem, we will turn on multiple instances of each HM9000 component across multiple nodes. These instances will vie for a lock in the high-availability store. The instance that grabs the lock gets to run and is responsible for maintaining the lock. Should that instance enter a bad state or die, the lock becomes available allowing another instance to pick up the slack. Since all state is stored in the store, the backup component should be able to function independently of the failed component.

For more information, see the HM9000 release announcement.

Deployment

Recovering from Failure

If HM9000 enters a bad state, the simplest solution - typically - is to delete the contents of the data store. Follow the steps defined by the etcd-release for Disaster Recovery HM9000 should recover on its own.

Installing HM9000 locally

Assuming you have go v1.5+ installed:

Clone dea-hm-workspace and its submodules:

 $ cd $HOME (or other appropriate base directory)
 $ git clone https://github.com/cloudfoundry/dea-hm-workspace
 $ cd dea-hm-workspace
 $ git submodule update --init --recursive
 $ mkdir bin
 $ export GOPATH=$PWD
 $ export PATH=$PATH:$GOPATH/bin

Download and install gnatsd (the version downloaded here is for linux-x64 - if you have a different platform, be sure to download the correct tarball):

 $ wget https://github.com/nats-io/gnatsd/releases/download/v0.7.2/gnatsd-v0.7.2-linux-amd64.tar.gz
 $ tar xzf gnatsd-v0.7.2-linux-amd64.tar.gz
 $ mv ./gnatsd $GOPATH/bin

Install etcd to $GOPATH/bin (the downloaded version here is for linux-x64 - if you have a different platform, be sure to download the correct tarball)

 $ wget https://github.com/coreos/etcd/releases/download/v2.2.4/etcd-v2.2.4-linux-amd64.tar.gz
 $ tar xzf etcd-v2.2.4-linux-amd64.tar.gz
 $ mv etcd-v2.2.4-linux-amd64/etcd $GOPATH/bin

Start etcd:
```
 $ mkdir $HOME/etcdstorage
 $ (cd $HOME/etcdstorage && etcd &)
```
etcd generates a number of files in the current working directory when run locally, hence etcdstorage
Run hm9000:
```
 $ go install github.com/cloudfoundry/hm9000
 $ hm9000 <args>
```
and get usage information. Run hm9000 --help to see supported commands.
Install consul (if you plan to run the integration test suite):

The mcat integration test suite requires that the consul binary be in your PATH. Refer to the installation instructions for your specific platform to download an install consul.

Running the tests

 $ go get github.com/onsi/ginkgo/ginkgo
 $ cd src/github.com/cloudfoundry/hm9000/
 $ ginkgo -r -p -skipMeasurements -race -failOnPending -randomizeAllSpecs

These tests will spin up their own instances of etcd as needed. It shouldn't interfere with your long-running etcd server.

Updating hm9000. You'll need to fetch the latest code and recompile the hm9000 binary:

 $ cd $GOPATH/src/github.com/cloudfoundry/hm9000
 $ git checkout master
 $ git pull
 $ go install .

Running HM9000

hm9000 requires a config file. To get started:

$ cd $GOPATH/src/github.com/cloudfoundry/hm9000
$ cp ./config/default_config.json ./local_config.json
$ vim ./local_config.json

You must specify a config file for all the hm9000 commands. You do this with (e.g.) --config=./local_config.json

Analyzing desired state

hm9000 analyze --config=./local_config.json

will connect to CC, fetch the desired state, put it in the store, compute the delta between desired and actual state, and then evaluate the pending starts and stops and publishes them over NATS. You can optionally pass -poll to manage desired state periodically.

Listening for actual state

hm9000 listen --config=./local_config.json

will come up, listen for heartbeat messages via NATS and HTTP, and put them in the store.

Serving API

hm9000 serve_api --config=./local_config.json

will come up and provide response to requests for /bulk_app_state over HTTP.

Evacuator

hm9000 evacuator --config=./local_config.json

will come up and listen for droplet.exited messages and queue start messages for any evacuating droplets. Start messages will be sent when the analyzer sends start and stop messages. The evacuator is not necessary for deterministic evacuation but is provided for backward compatibility with old DEAs. There is no harm in running the evacuator during deterministic evacuation.

Shredder

hm9000 shred --config=./local_config.json

The shredder will periodically (once per hour, by default) compact the store - removing any orphaned (empty) directories. You can optionally pass -poll to send messages periodically.

Dumping the contents of the store

hm9000 dump --config=./local_config.json

will dump the entire contents of the store to stdout. The output is structured in terms of apps and provides insight into the state of a cloud foundry installation. If you want a raw dump of the store's contents pass the --raw flag.

etcd has a very simple curlable API, which you can use in lieu of dump.

How to dump the contents of the store on a bosh deployed health manager

watch -n 1 /var/vcap/packages/hm9000/hm9000 dump --config=/var/vcap/jobs/hm9000/config/hm9000.json

on a health manager instance should dump the store.

HM9000 Config

HM9000 is configured using a JSON file. Here are the available entries:

heartbeat_period_in_seconds: Almost all configurable time constants in HM9000's config are specified in terms of this one fundamental unit of time - the time interval between heartbeats in seconds. This should match the value specified in the DEAs and is typically set to 10 seconds.
heartbeat_ttl_in_heartbeats: Incoming heartbeats are stored in the store with a TTL. When this TTL expires the instane associated with the hearbeat is considered to have "gone missing". This TTL is set to 3 heartbeat periods.
actual_freshness_ttl_in_heartbeats: This constant serves two purposes. It is the TTL of the actual-state freshness key in the store. The store's representation of the actual state is only considered fresh if the actual-state freshness key is present. Moreover, the actual-state is fresh only if the actual-state freshness key has been present for at least actual_freshness_ttl_in_heartbeats. This avoids the problem of having the first detected heartbeat render the entire actual-state fresh -- we must wait a reasonable period of time to hear from all DEAs before calling the actual-state fresh. This TTL is set to 3 heartbeat periods
grace_period_in_heartbeats: A generic grace period used when scheduling messages. For example, we delay start messages by this grace period to give a missing instance a chance to start up before sending a start message. The grace period is set to 3 heartbeat periods.
desired_freshness_ttl_in_heartbeats: The TTL of the desired-state freshness. Set to 12 heartbeats. The desired-state is considered stale if it has not been updated in 12 heartbeats.
store_max_concurrent_requests: The maximum number of concurrent requests that each component may make to the store. Set to 30.
sender_message_limit: The maximum number of messages the sender should send per invocation. Set to 30.
sender_polling_interval_in_heartbeats: The time period in heartbeat units between sender invocations when using hm9000 send --poll. Set to 1.
sender_timeout_in_heartbeats: The timeout in heartbeat units for each sender invocation. If an invocation of the sender takes longer than this the hm9000 send --poll command will fail. Set to 10.
fetcher_polling_interval_in_heartbeats: The time period in heartbeat units between desired state fetcher invocations when using hm9000 fetch_desired --poll. Set to 6.
fetcher_timeout_in_heartbeats: The timeout in heartbeat units for each desired state fetcher invocation. If an invocation of the fetcher takes longer than this the hm9000 fetch_desired --poll command will fail. Set to 60.
analyzer_polling_interval_in_heartbeats: The time period in heartbeat units between analyzer invocations when using hm9000 analyze --poll. Set to 1.
analyzer_timeout_in_heartbeats: The timeout in heartbeat units for each analyzer invocation. If an invocation of the analyzer takes longer than this the hm9000 analyze --poll command will fail. Set to 10.
shredder_polling_interval_in_heartbeats: The time period in heartbeat units between shredder invocations when using hm9000 shred --poll. Set to 360.
shredder_timeout_in_heartbeats: The timeout in heartbeat units for each shredder invocation. If an invocation of the shredder takes longer than this the hm9000 analyze --poll command will fail. Set to 6.
number_of_crashes_before_backoff_begins: When an instance crashes HM9000 immediately restarts it. If, however, the number of crashes exceeds this number HM9000 will apply an increasing delay to the restart.
starting_backoff_delay_in_heartbeats: The initial delay (in heartbeat units) to apply to the restart message once an instance crashes more than number_of_crashes_before_backoff_begins times.
maximum_backoff_delay_in_heartbeats: The restart delay associated with crashes doubles with each crash but is not allowed to exceed this value (in heartbeat units).
listener_heartbeat_sync_interval_in_milliseconds: The listener aggregates heartbeats and flushes them to the store periodically with this interval.
store_heartbeat_cache_refresh_interval_in_milliseconds: To improve performance when writing heartbeats, the store maintains a write-through cache of the store contents. This cache is invalidated and refetched periodically with this interval.
cc_auth_user: The user to use when authenticating with the CC desired state API. Set by BOSH.
cc_auth_password: The password to use when authenticating with the CC desired state API. Set by BOSH.
cc_base_url: The base url for the CC API. Set by BOSH.
desired_state_batch_size: The batch size when fetching desired state information from the CC. Set to 500.
fetcher_network_timeout_in_seconds: Each API call to the CC must succeed within this timeout. Set to 10 seconds.
store_schema_version: The schema of the store. HM9000 does not migrate the store, instead, if the store data format/layout changes and is no longer backward compatible the schema version must be bumped.
store_urls: An array of etcd server URLs to connect to.
actual_freshness_key: The key for the actual freshness in the store. Set to "/actual-fresh".
desired_freshness_key: The key for the actual freshness in the store. Set to "/desired-fresh".
dropsonde_port: The port which metron is listening on to receive metrics.
api_server_address: The IP address of machine runnine HM9000.
api_server_port: The port in which to serve the HTTP API.
api_server_username: User name to be used for basic auth on the API server.
api_server_password: Password to be used for basic auth on the API server.
log_level: Must be one of "INFO" or "DEBUG"
sender_nats_start_subject: The NATS subject for HM9000's start messages. Set to "hm9000.start".
sender_nats_stop_subject: The NATS subject for HM9000's stop messages. Set to "hm9000.stop".
nats.host: The NATS host. Set by BOSH.
nats.port: The NATS host. Set by BOSH.
nats.user: The user for NATS authentication. Set by BOSH.
nats.password: The password for NATS authentication. Set by BOSH.

HM9000 components

`hm9000` (the top level) and `hm`

The top level is home to the hm9000 CLI. The hm package houses the CLI logic to keep the root directory cleaner. The hm package is where the other components are instantiated, fed their dependencies, and executed.

`actualstatelistener`

The actualstatelistener provides a simple listener daemon that monitors the NATS stream for app heartbeats. It generates an entry in the store for each heartbeating app under /actual/INSTANCE_GUID.

It also maintains a FreshnessTimestamp under /actual-fresh to allow other components to know whether or not they can trust the information under /actual

`desiredstatefetcher`

The desiredstatefetcher requests the desired state from the cloud controller. It transparently manages fetching the authentication information over NATS and making batched http requests to the bulk api endpoint.

Desired state is stored under `/desired/APP_GUID-APP_VERSION

`analyzer`

The analyzer comes up, analyzes the actual and desired state, and puts pending start and stop messages in the store. If a start or stop message is already in the store, the analyzer will not override it.

These are the metrics emitted:

NumberOfAppsWithAllInstancesReporting: The number of desired applications for which all instances are reporting (the state of the instance is irrelevant: STARTING/RUNNING/CRASHED all count).
NumberOfAppsWithMissingInstances: The number of desired applications for which an instance is missing (i.e. the instance is simply not heartbeating at all).
NumberOfUndesiredRunningApps: The number of undesired applications with at least one instance reporting as STARTING or RUNNING.
NumberOfRunningInstances: The number of instances in the STARTING or RUNNING state.
NumberOfMissingIndices: The number of missing instances (these are instances that are desired but are simply not heartbeating at all).
NumberOfCrashedInstances: The number of instances reporting as crashed.
NumberOfCrashedIndices: The number of indices reporting as crashed. Because of the restart policy an individual index may have very many crashes associated with it.

If either the actual state or desired state are not fresh all of these metrics will have the value -1.

`sender`

The sender runs periodically and pulls pending messages out of the store and sends them over NATS. The sender verifies that the messages should be sent before sending them (i.e. missing instances are still missing, extra instances are still extra, etc...) The sender is also responsible for throttling the rate at which messages are sent over NATS. ÂÂ

`apiserver`

The apiserver responds to NATS app.state messages and allow other CloudFoundry components to obtain information about arbitrary applications.

`evacuator`

The evacuator responds to NATS droplet.exited messages. If an app exists because it is EVACUATING the evacuator sends a start message over NATS. The evacuator is not necessary during deterministic evacuations but is provided to maintain backward compatibility with older DEAs.

`shredder`

The shredder prunes old/crufty/unnecessary data from the store. This includes pruning old schema versions of the store.

Support Packages

`config`

config parses the config.json configuration. Components are typically given an instance of config by the hm CLI.

`helpers`

helpers contains a number of support utilities.

`httpclient`

A trivial wrapper around net/http that improves testability of http requests.

`logger`

Provides a (sys)logger. Eventually this will use steno to perform logging.

`metricsaccountant`

Supports metrics tracking. Used by the metricsserver and components that post metrics.

`models`

models encapsulates the various JSON structs that are sent/received over NATS/HTTP. Simple serializing/deserializing behavior is attached to these structs.

`store`

store sits on top of the lower-level storeadapter and provides the various hm9000 components with high-level access to the store (components speak to the store about setting and fetching models instead of the lower-level StoreNode defined inthe storeadapter).

Test Support Packages (under testhelpers)

testhelpers contains a (large) number of test support packages. These range from simple fakes to comprehensive libraries used for faking out other CloudFoundry components (e.g. heartbeating DEAs) in integration tests.

Fakes

`fakelogger`

Provides a fake implementation of the helpers/logger interface

`fakehttpclient`

Provides a fake implementation of the helpers/httpclient interface that allows tests to have fine-grained control over the http request/response lifecycle.

`fakemetricsaccountant`

Provides a fake implementation of the helpers/metricsaccountant interface that allows test to make assertions on metrics tracking.

Fixtures & Misc.

`app`

app is a simple domain object that encapsulates a running CloudFoundry app.

The app package can be used to generate self-consistent data structures (heartbeats, desired state). These data structures are then passed into the other test helpers to simulate a CloudFoundry eco-system.

Think of app as your source of fixture test data. It's intended to be used in integration tests and unit tests.

Some brief documentation -- look at the code and tests for more:

//get a new fixture app, this will generate appropriate
//random APP and VERSION GUIDs
app := NewApp()

//Get the desired state for the app.  This can be passed into
//the desired state server to simulate the APP's presence in
//the CC's DB.  By default the app is staged and started, to change
//this, modify the return value.
desiredState := app.DesiredState(NUMBER_OF_DESIRED_INSTANCES)

//get an instance at index 0.  this getter will lazily create and memoize
//instances and populate them with an INSTANCE_GUID and the correct
//INDEX.
instance0 := app.InstanceAtIndex(0)

//generate a heartbeat for the app.
//note that the INSTANCE_GUID associated with the instance at index 0 will
//match that provided by app.InstanceAtIndex(0)
app.Heartbeat(NUMBER_OF_HEARTBEATING_INSTANCES)

`custommatchers`

Provides a collection of custom Gomega matchers.

Infrastructure Helpers

`startstoplistener`

Listens on the NATS bus for health.start and health.stop messages. It parses these messages and makes them available via a simple interface. Useful for testing that messages are sent by the health manager appropriately.

`desiredstateserver`

Brings up an in-process http server that mimics the CC's bulk endpoints (including authentication via NATS and pagination).

`natsrunner`

Brings up and manages the lifecycle of a live NATS server. After bringing the server up it provides a fully configured cfmessagebus object that you can pass to your test subjects.

The MCAT

The MCAT is as HM9000's integration test suite. It tests HM9000 by providing it with inputs (desired state, actual state heartbeats, and time) and asserting on its outputs (start and stop messages and api/metrics endpoints).

In addition to the MCAT there is a performance-measuring test suite at https://github.com/pivotal-cf-experimental/hmperformance.

hm9000's People

Contributors

Stargazers

Watchers

hm9000's Issues

test gitbot integration update

....

removing store

There is a comment about

"Delete (or move) the etcd storage directory located under /var/vcap/store"

but if this is done, etcd wont start again.

2016/08/18 03:52:26 etcdserver: create snapshot directory error: mkdir /var/vcap/store/etcd: permission denied

perhaps instead say to remove the content in the snap and wal folders? Or perhaps the real bug is in the permissions?

bosh_etqu6msmw@3f6cd1d9-a2b9-489f-8e9f-6e573d7072ea:$ ls -l /var/vcap/
total 40
drwx------ 9 root root 4096 Aug 16 04:59 bosh
drwxr-xr-x 5 root root 4096 Aug 18 03:59 bosh_ssh
drwxr-xr-x 8 root root 4096 Aug 16 05:12 data
drwxr-xr-x 2 root root 4096 Aug 16 04:59 instance
drwxr-xr-x 2 root root 4096 Aug 16 05:12 jobs
drwxr-xr-x 2 root root 4096 Jun 28 23:00 micro
drwxr-xr-x 3 root root 4096 Jun 28 22:44 micro_bosh
drwxr-xr-x 5 root root 4096 Aug 16 05:13 monit
drwxr-xr-x 2 root root 4096 Aug 16 05:12 packages
drwxr-xr-x 4 root root 4096 Aug 18 03:56 store
lrwxrwxrwx 1 root root 18 Jun 28 22:44 sys -> /var/vcap/data/sys
bosh_etqu6msmw@3f6cd1d9-a2b9-489f-8e9f-6e573d7072ea:$ ls -l /var/vcap/store/
total 20
drwxr-xr-x 3 vcap vcap 4096 Aug 16 05:13 etcd
drwx------ 2 root root 16384 Aug 16 05:12 lost+found
bosh_etqu6msmw@3f6cd1d9-a2b9-489f-8e9f-6e573d7072ea:$ ls -l /var/vcap/store/etcd/
total 4
drwx------ 4 vcap vcap 4096 Aug 16 05:13 member
bosh_etqu6msmw@3f6cd1d9-a2b9-489f-8e9f-6e573d7072ea:$ ls -l /var/vcap/store/etcd/member/
ls: cannot open directory /var/vcap/store/etcd/member/: Permission denied
bosh_etqu6msmw@3f6cd1d9-a2b9-489f-8e9f-6e573d7072ea:~$ sudo ls -l /var/vcap/store/etcd/member/
total 8
drwx------ 2 vcap vcap 4096 Aug 18 03:55 snap
drwx------ 2 vcap vcap 4096 Aug 18 03:57 wal

slave hm9000_api_server stops listening on port 5155

After upgrading to cf v235, we noticed that 'cf apps' calls would show ? in the instance counts periodically. Looks like when running 2 hm9000's, only one of hm9000_api_server instances will have port 5155 listening. If gorouter sends them to the one not listening, logs an error and you get back all ? on your app instance states. Talking to devs in cf slack channel they indicated this is an issue and both hm9000_api_server instances should be always listening on port 5155.

Actual state save timeout in Listener

The Listener logs has below error almost every 20 seconds:
{"timestamp":1463115787.966745853,"process_id":23237,"source":"vcap.hm9000.listener","log_level":"info","message":"Saving Heartbeats - {"Heartbeats to Save":"46"}","data":null}
{"timestamp":1463116020.814734697,"process_id":23237,"source":"vcap.hm9000.listener","log_level":"info","message":"Save took too long. Not bumping freshness.","data":null}

The cause is that the duration of ensureCacheIsReady (which is 20 seconds interval) is counted in the duration of SyncHeartbeats, that causes actual state timeout.

https://github.com/cloudfoundry/hm9000/blob/master/store/actual_state.go#L42

description for evacuator is wrong in hm9000.go ?

Name: "evacuator",
Description: "Listens for Varz calls to serve metrics",

Hm9000

i got dmesg info as below:

Reduce logging level of a number of messages

We've got hm9000 deployed in our environment and the log chatter produced by the hm9000 is significant. It is difficult to find something significant when it does happen.

I'd like to reduce the logging level of a number of the "info" messages today to "debug". I'd be happy to submit a PR for this work.

Looking through our logs these are the messages I'd like to lower:

Saved Heartbeats:

{"timestamp":1400019974.531062126,"process_id":20499,"source":"vcap.hm9000.listener","log_level":"info","message":"Saved Heartbeats - {\"Duration\":\"103.324349ms\",\"Heartbeats to Save\":\"1\"}","data":null}

Bumped freshness:

{"timestamp":1400019974.530391455,"process_id":20499,"source":"vcap.hm9000.listener","log_level":"info","message":"Bumped freshness","data":null}

Saving Heartbeats:

{"timestamp":1400019974.426795721,"process_id":20499,"source":"vcap.hm9000.listener","log_level":"info","message":"Saving Heartbeats - {\"Heartbeats to Save\":\"1\"}","data":null}

Received a heartbeat:

{"timestamp":1400019974.291958094,"process_id":20499,"source":"vcap.hm9000.listener","log_level":"info","message":"Received a heartbeat - {\"Heartbeats Pending Save\":\"1\"}","data":null}

Daemonize Time

{"timestamp":1400019971.193106413,"process_id":23633,"source":"vcap.hm9000.analyzer","log_level":"info","message":"Daemonize Time - {\"Component\":\"Analyzer\",\"Duration\":\"0.0266\"}","data":null}

Analyzer completed successfully:

{"timestamp":1400019971.192697287,"process_id":23633,"source":"vcap.hm9000.analyzer","log_level":"info","message":"Analyzer completed succesfully","data":null}

Skipping Already Enqueued Start Message

{"timestamp":1400019971.190926075,"process_id":23633,"source":"vcap.hm9000.analyzer","log_level":"info","message":"Skipping Already Enqueued Start Message: Identified crashed instance - {App instance data}","data":null}

Analyzing...

{"timestamp":1400019971.166519880,"process_id":23633,"source":"vcap.hm9000.analyzer","log_level":"info","message":"Analyzing...","data":null}

hm9000.desired_state_batch_size default value too high

We recently analyzed an HM9000 failure which turned out to be caused by timeouts in hm9000_analyzer.log like

{"timestamp":"1481728235.277132988","source":"fetcher","message":"fetcher.HTTP request failed with error","log_level":2,"data":{"error":"Get https://api.REDACTED/bulk/apps?batch_size=5000\u0026bulk_token={}: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}}

this lead to HM9000 failure (which went unnoticed by monit or bosh)

We found that reducing the hm9000.desired_state_batch_size from 5000 to 500 works around the problem.

Measurements with curl for above URL show that the time the request takes scales linearly with the batch size (and total number of apps):

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=50&bulk_token=%7B%7D' -o /dev/null
real	0m1.369s

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=500&bulk_token=%7B%7D' -o /dev/null
real	0m10.928s

$ time curl -s -u REDACTED:REDACTED 'https://api.REDACTED/bulk/apps?batch_size=1000&bulk_token=%7B%7D' -o /dev/null
real	0m19.789s

You can see that with a bulk size of about 2000 you would exceed the 30 sec default timeout for our CF installation.
E.g. if you have a CF installation growing in number of apps, at some point you hit the timeout.

We found that batch size used to be 500 but was changed for unknown reasons to 5000 with commit cloudfoundry-attic/cf-release@c6b7848

However the same commit did not increase the timeout by the same factor of 10 (this assumes linear behaviour we saw above).

Also, the docs still state "desired_state_batch_size: The batch size when fetching desired state information from the CC. Set to 500."
The default config.json also still uses batch size 500 and timeout 10s.

Should the default batch size be lowered and/or the default timeout be increased?

README.md and prime time

The readme still indicates that "hm9000 is not yet a complete replacement for health_manager -- we'll update this README when it's ready for primetime.", yet, as of v164, there is no health_montior in the cloud foundry release. That certainly suggests that HM9000 is the only health monitor option for 164. Does that mean it is ready for primetime? Or should users be without any health monitor?

why name hm9000

the reason for "9000"?

hm9000_analyzer occasionally causes "monit status" to break

Occasionally, "monit status" on an hm9000 instance will return a failed status instead of detailed process status. This is resolved by restarting the hm9000_analyzer process. I assume this process is occasionally returning something that monit can't handle.

This problem cascades into bosh where "bosh vms DEPLOYMENT" will return an error related to monit. Once hm9000_analyzer is restarted on the offending instance, "bosh vms DEPLOYMENT" will return the list of vms as it should.

app status mismatch, error log find in HM

szxciscslx16689:/var/vcap/sys/log/hm9000 # cf apps

Getting apps in org huawei / space paasspace as admin...
OK

name                    requested state   instances   memory   disk   urls
sample                  started           0/1         256M     256M   sample.10.64.28.217.xip.io

szxciscslx16689:/var/vcap/sys/log/hm9000 #

szxciscslx16689:/var/vcap/sys/log/hm9000 # cf app sample

Showing health and status for app sample in org huawei / space paasspace as admin...
OK

requested state: started
instances: 0/1
usage: 256M x 1 instances
urls: sample.10.64.28.217.xip.io

     state     since                    cpu    memory           disk
#0   running   2014-11-11 10:21:50 AM   0.0%   157.3M of 256M   192.4M of 256M

szxciscslx16689:/var/vcap/sys/log/hm9000 #

app status is not mach. we user cf 173, which container HM9000 and etcd. we check hm9000 log. many logs like:

hm9000_shredder.stdout.log:{"timestamp":1596253891.656481743,"process_id":83833,"source":"vcap.hm9000.shredder","log_level":"info","message":"Acquired lock for Shredder","data":null}
hm9000_shredder.stdout.log:{"timestamp":1596253891.280230761,"process_id":83833,"source":"vcap.hm9000.shredder","log_level":"error","message":"Lost the lock - Error:Lock the lock","data":null}
hm9000_shredder.stdout.log:{"timestamp":1415754832.953617811,"process_id":62232,"source":"vcap.hm9000.shredder","log_level":"info","message":"Acquiring lock for Shredder","data":null}
hm9000_shredder.stdout.log:{"timestamp":1415754832.955469847,"process_id":62232,"source":"vcap.hm9000.shredder","log_level":"info","message":"Acquired lock or Shredder","data":null}

hm9000_analyzer.stdout.log:{"timestamp":1415711950.672348499,"process_id":49734,"source":"vcap.hm9000.analyzer","log_level":"error","message":"Analyzer failed with error - Error:Actual state is not fresh","data":null}
hm9000_analyzer.stdout.log:{"timestamp":1415711950.672435999,"process_id":49734,"source":"vcap.hm9000.analyzer","log_level":"error","message":"Daemon returned an error. Continuining... - Error:Actual state is not fresh","data":null}
hm9000_analyzer.stdout.log:{"timestamp":1415711960.672456503,"process_id":49734,"source":"vcap.hm9000.analyzer","log_level":"error","message":"Store is not fresh - Error:Actual state is not fresh","data":null}

we delete etcd dir and restart etcd hm9000 and so on, the issue is same.

if we restart the VM and cf both, it can recover, but the issue re-occur when running some hours.

anybody can help check it? how to fix it?

why run log printed in syslog file, such as /var/log/messages?

i run in suse 11 sp3, in /var/log/messages, i find content ：

Dec 17 15:30:07 szxciscslx16768 vcap.hm9000.listener: {"timestamp":1418801407.709357262,"process_id":6425,"source":"vcap.hm9000.listener","log_level":"info","message":"Received a heartbeat - {"Heartbeats Pending Save":"1"}","data":null}
Dec 17 15:30:08 szxciscslx16768 vcap.hm9000.listener: {"timestamp":1418801408.172760725,"process_id":6425,"source":"vcap.hm9000.listener","log_level":"info","message":"Saving Heartbeats - {"Heartbeats to Save":"1"}","data":null}
Dec 17 15:30:08 szxciscslx16768 vcap.hm9000.listener: {"timestamp":1418801408.175098896,"process_id":6425,"source":"vcap.hm9000.listener","log_level":"info","message":"Bumped freshness","data":null}
Dec 17 15:30:08 szxciscslx16768 vcap.hm9000.listener: {"timestamp":1418801408.175140142,"process_id":6425,"source":"vcap.hm9000.listener","log_level":"info","message":"Saved Heartbeats - {"Duration":"2.290513ms","Heartbeats to Save":"1"}","data":null}

can i close this?

Failed to handle app.state request - Error:App not found

I do ’cf unmap-route APP DOMAIN‘, then send 'cf apps' to get app info, but get incorrect result of '0/1 instances'.

I found error logs as follow below in hm9000_api_server.stdout.log:
{"timestamp":1400189188.394666433,"process_id":21551,"source":"vcap.hm9000.api_server","log_level":"error","message":"Failed to handle app.state request - Error:App not found - {"elapsed time":"2.621651ms","payload":"{\"droplet\":\"08d86b67-d3b1-4e21-991a-5add5ba5fc9d\",\"version\":\"f5a8bfb4-561b-4415-8f6a-7d48d8f7964c\"}"}","data":null}

panic: Your test failed

I git clone the latest code, then run the test case error
--- export GOPATH=$HOME/hm-workspace
--- export PATH=$HOME/hm-workspace/bin:$PATH
--- gem install eventmachine -v 0.12.10
--- gem install thin -v 1.4.1
-- gem install nats -v 0.4.28
--- export PATH=$HOME/bin/zookeeper/bin:$HOME/hm-workspace/bin:$PATH
---root@wangping-vm02:~/hm-workspace/src/github.com/cloudfoundry/hm9000# ginkgo -r --randomizeAllSpecs --failOnPending --skipMeasurements
[1413482271] Metricsaccountant Suite - 12/12 specs ?⑩?⑩?⑩?⑩?⑩?⑩?⑩?⑩?⑩?⑩?⑩??SUCCESS! 2.43233ms PASS
--- FAIL: TestHM9000 (0.00 seconds)
panic:
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

defer GinkgoRecover()

at the top of the goroutine that caused this panic.
[recovered]
panic:
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

defer GinkgoRecover()

at the top of the goroutine that caused this panic.

goroutine 22 [running]:
runtime.panic(0x6d0720, 0xc208000ae0)
/usr/local/go/src/pkg/runtime/panic.c:279 +0xf5
testing.func路006()
/usr/local/go/src/pkg/testing/testing.go:416 +0x176
runtime.panic(0x6d0720, 0xc208000ae0)
/usr/local/go/src/pkg/runtime/panic.c:248 +0x18d
github.com/onsi/ginkgo.Fail(0xc208003e60, 0x116, 0xc2080010b0, 0x1, 0x1)
/root/hm-workspace/src/github.com/onsi/ginkgo/ginkgo_dsl.go:229 +0xeb
github.com/onsi/gomega/internal/assertion.(_Assertion).match(0xc208042c00, 0x7f58fb796848, 0x9db760, 0x0, 0xc208000e20, 0x1, 0x1, 0x7f58fb60bda8)
/root/hm-workspace/src/github.com/onsi/gomega/internal/assertion/assertion.go:69 +0x324
github.com/onsi/gomega/internal/assertion.(_Assertion).ShouldNot(0xc208042c00, 0x7f58fb796848, 0x9db760, 0xc208000e20, 0x1, 0x1, 0xc208042c00)
/root/hm-workspace/src/github.com/onsi/gomega/internal/assertion/assertion.go:31 +0x9a
github.com/cloudfoundry/storeadapter/storerunner/etcdstorerunner.(_ETCDClusterRunner).start(0xc2080425c0, 0x1)
/root/hm-workspace/src/github.com/cloudfoundry/storeadapter/storerunner/etcdstorerunner/etcd_cluster_runner.go:142 +0xa7d
github.com/cloudfoundry/storeadapter/storerunner/etcdstorerunner.(_ETCDClusterRunner).Start(0xc2080425c0)
/root/hm-workspace/src/github.com/cloudfoundry/storeadapter/storerunner/etcdstorerunner/etcd_cluster_runner.go:40 +0x2c
github.com/cloudfoundry/hm9000/hm_test.TestHM9000(0xc20804e1b0)
/root/hm-workspace/src/github.com/cloudfoundry/hm9000/hm/hm_suite_test.go:18 +0xbd
testing.tRunner(0xc20804e1b0, 0x9d00a0)
/usr/local/go/src/pkg/testing/testing.go:422 +0x8b
created by testing.RunTests
/usr/local/go/src/pkg/testing/testing.go:504 +0x8db

goroutine 16 [chan receive]:
testing.RunTests(0x88bd88, 0x9d00a0, 0x1, 0x1, 0x88bc01)
/usr/local/go/src/pkg/testing/testing.go:505 +0x923
testing.Main(0x88bd88, 0x9d00a0, 0x1, 0x1, 0x9e04e0, 0x0, 0x0, 0x9e04e0, 0x0, 0x0)
/usr/local/go/src/pkg/testing/testing.go:435 +0x84
main.main()
github.com/cloudfoundry/hm9000/hm/_test/_testmain.go:47 +0x9c

goroutine 19 [finalizer wait]:
runtime.park(0x413440, 0x9dac30, 0x9d8e69)
/usr/local/go/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x9dac30, 0x9d8e69)
/usr/local/go/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
/usr/local/go/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
/usr/local/go/src/pkg/runtime/proc.c:1445

goroutine 20 [syscall]:
os/signal.loop()
/usr/local/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init路1
/usr/local/go/src/pkg/os/signal/signal_unix.go:27 +0x32

Ginkgo ran 9 suites in 19.537042993s
Test Suite Failed

my go version: go version go1.3.3 linux/amd64

What's the problem?

github.com/cloudfoundry/gunk/timeprovider doesn't exists in master branck of cloudfoundry/gunk

dean@dean-Aspire-4740:~/Documents/gopath/src/github.com/cloudfoundry/gunk$ go get github.com/cloudfoundry/hm9000
package github.com/cloudfoundry/gunk/diegonats/testrunner
    imports github.com/cloudfoundry/gunk/diegonats/testrunner
    imports github.com/cloudfoundry/gunk/diegonats/testrunner: cannot find package "github.com/cloudfoundry/gunk/diegonats/testrunner" in any of:
    /usr/lib/go/src/pkg/github.com/cloudfoundry/gunk/diegonats/testrunner (from $GOROOT)
    /home/dean/Documents/gopath/src/github.com/cloudfoundry/gunk/diegonats/testrunner (from $GOPATH)
package github.com/cloudfoundry/gunk/diegonats/testrunner
    imports github.com/cloudfoundry/gunk/timeprovider
    imports github.com/cloudfoundry/gunk/timeprovider
    imports github.com/cloudfoundry/gunk/timeprovider: cannot find package "github.com/cloudfoundry/gunk/timeprovider" in any of:
    /usr/lib/go/src/pkg/github.com/cloudfoundry/gunk/timeprovider (from $GOROOT)
    /home/dean/Documents/gopath/src/github.com/cloudfoundry/gunk/timeprovider (from $GOPATH)
package github.com/cloudfoundry/gunk/diegonats/testrunner
    imports github.com/cloudfoundry/gunk/timeprovider
    imports github.com/cloudfoundry/gunk/timeprovider/faketimeprovider
    imports github.com/cloudfoundry/gunk/timeprovider/faketimeprovider
    imports github.com/cloudfoundry/gunk/timeprovider/faketimeprovider: cannot find package "github.com/cloudfoundry/gunk/timeprovider/faketimeprovider" in any of:
    /usr/lib/go/src/pkg/github.com/cloudfoundry/gunk/timeprovider/faketimeprovider (from $GOROOT)
    /home/dean/Documents/gopath/src/github.com/cloudfoundry/gunk/timeprovider/faketimeprovider (from $GOPATH)

'?' is displayed in a 'instances' column of cf app command result

'?' is displayed in a 'instances' column of cf app command result, if 2 or more than 2 HM9000 instances are running. This happens sporadically, not always. I think this is caused by the following reason:

Only one api server of HM9000 runs when 2 or more than 2 instances of HM9000 are deployed.
However, registrar always runs in each HM9000 instance. Therefore, all routes of api server are registered to Go Router. Therefore, there are possibilities that the bulk api requests are routed to a stopping api server.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.