Rackspace Cloud Monitoring Agent

Home Page: http://virgoagent.com/rackspace-monitoring-agent/

License: Apache License 2.0

Lua 93.68% Shell 1.55% Makefile 0.22% Python 0.01% PowerShell 0.02% Tcl 0.30% Batchfile 0.53% CMake 3.54% Dockerfile 0.14%

rackspace-monitoring-agent's Introduction

Rackspace Monitoring Agent

The monitoring agent is the first agent to use the infrastructure provided by virgo-base-agent

Installing The Agent

Make sure you have the required packages to build things on your system. The Dockerfile will contain the development dependencies.

Please note, we provide binaries for many platforms. Check out the article for Installing the Monitoring Agent for instructions.

Otherwise, continue reading this section.

Satisfy pre-requisites:

If you're on windows you may have to either install or find and add certain utilities to your path beforehand. These are:

cmake - Downloadable from cmake gnu site
nmake - Included in Visual studio/VC/bin but may need to be inserted into your path
signtool - Included in Microsoft SDKs/windows/v7.1a/bin but may need to be inserted into your path

On Linux from a fresh install:

apt-get install make cmake

Get the source:

git clone https://github.com/virgo-agent-toolkit/rackspace-monitoring-agent

Go into the directory that you just created:

cd rackspace-monitoring-agent

Build:

make

Now simply install the virgo client by running this last and final command:

make install

After installing on unix systems, there is a new binary called rackspace-monitoring-agent. To get the client running on your system please follow the documented setup procedure.

Host Info Runner

The agent has a built in host information runner (similar to OHAI).

rackspace-monitoring-agent -e hostinfo_runner -x [type]

Further documentation for the host informations can be found in the hostinfo readme

License

The Monitoring Agent is distributed under the Apache License 2.0.

Building for Rackspace Cloud Monitoring

Rackspace customers: Virgo is the open source project for the Rackspace Cloud Monitoring agent. Feel free to build your own copy from this source.

But! Please don't contact Rackspace Support about issues you encounter with your custom build.

Versioning

The agent is versioned with a three digit dot seperated "semantic version" with the template being x.y.z. An example being e.g. 1.4.2. The rough meaning of each of these parts are:

major version numbers will change when we make a backwards incompatible change to the bundle format. Binaries can only run bundles with identical major version numbers. e.g. a binary of version 2.3.1 can only run bundles starting with 2.
minor version numbers will change when we make backwards compatible changes to the bundle format. Binaries can only run bundles with minor versions that are greater than or equal to the bundle version. e.g. a binary of version 2.3.1 can run a 2.3.4 bundle but not a 2.2.1 bundle.
patch version numbers will change everytime a new bundle is released. It has no semantic meaning to the versioning.

Running tests

Virgo supplies infrastructure for running tests. Calling make test will launch Virgo with command line flags set to feed it the testing bundle and with the -e flag set to tests.

make test

You can also run an individual test module:

TEST_MODULE=net make test

Running tests on docker

This only needs to be done once per terminal session:

docker-machine create agent
eval $(docker-machine env agent)

Use docker-compose to build and run the tests:

docker-compose run build make clean
docker-compose run build make
docker-compose run build test

Configuration File Parameters

monitoring_token [token]         - (required) The authentication token.
monitoring_id [agent_id]         - (optional) The Agent's monitoring_id
                                   (default: Instance ID (Xen) or Cloud-Init ID)
monitoring_snet_region [region]  - (optional) Enable Service Net endpoints
                                   (region: dfw, ord, lon, syd, hkg, iad)
monitoring_endpoints             - (optional) Force IP and Port, comma
                                   delimited
monitoring_proxy_url [url]       - (optional) Use a HTTP Proxy
                                   Must support CONNECT on port 443.
                                   Additionally, HTTP_PROXY and HTTPS_PROXY
                                   are honored.
monitoring_query_endpoints [queries] - (optional) SRV queries comma
                                        delimited

Exit Codes

1 unknown error
2 config file fail
3 already running

Signals

SIGUSR1: Force GC
SIGUSR2: Toggle Debug

rackspace-monitoring-agent's People

Contributors

Stargazers

Watchers

rackspace-monitoring-agent's Issues

Make sure we bundle license as part of the agent distribution

Document: Server-side configuration

Create documentation on the READ.ME for server-side configuration.

Explain features, attributes, parameters, and requirements for server-side configuration.

Hackday idea: Create an End-to-End Solution to push metrics to Blueflood

Verify the plugin timeout change triggers an update to the check

Report Host Info Types that are only supported by the Agent on the particular OS

We have a thought that the list of host info types reported by the agent to the AEP should only be the types supported by this agent on this OS. Discuss.....

[FREEBSD] gmake instead of make

README is invalid for FreeBSD 10 installation. You need to go with gmake instead of make to properly start the build.

Agent update to support multi-tenant metrics submission

Emit agent health stats to the AEP

We should emit metrics from each agent through the AEPs to do extra monitoring:
memory usage
cpu time
process age
something else

Improve Virgo-Logging

Write Unit Tests for HUP support (logrotation)
Write StdoutFileWriter

https://github.com/rphillips/virgo-logging

Documentation for agent project

Goal: Create a website for Virgo Agent Project to explain what this project is about, how it can be valuable for other agent writers, and how to contribute to the project.

The target audience for document should be the developers who wish to learn about the project, use it to build other agents, or contribute to the project. This would not include the documentation on how the features are surfaced through any particular product, like Rackspace Cloud Monitoring. Those documentation should be managed by the vendors themselves.

Repo for doc: https://github.com/virgo-agent-toolkit/docs

Suggested Layout:
A. Documentation

Introduction (getting started)
Advanced topics (Troubleshooting and debugging)
Automation (how to use automation with the agent)

B. Version Information (see current and past version info)

Release notes
C. Resources
Link to List of Agent Check Types on Rackspace Documentation - documentation SPECIFIC to the project and its developers here.
* API docs(arguments, return values) of functions that lives in the code base, including checks.
* Differences or 'features' that the rackspace-monitoring-agent adds to virgo
* Common pitfalls/gotchas while developing a new check type
* Generic troubleshooting for connecting and using versions of rackspace-monitoring-agent that
List of Agent Plug-ins (not supported)
Link to other agent related features, like the capability to pick up configuration files and push to the server
D. Get Involved
How to Add example configs and YAML files
How to Improve Documentation
How to Add a plug-in

E. Developer Resources

Architecture Diagram
Plug-in Architecture
Coding style
About Luvit (likely links to other resources)
About Lua (likely links to other resources)

Inspiration: https://collectd.org/wiki/index.php/Main_Page

Agent checks should all support a common method for returning their metadata

Currently agent checks don't make it easy to get the metrics out with type and unit information. This metadata is stored in various ways and there's no way to programmatically extract it from all checks afaik.
https://github.com/racker/virgo/blob/master/check/apache.lua#L97
https://github.com/racker/virgo/blob/master/check/disk.lua#L59
It would be cool if all checks supported a method to return the metric metadata in a consistent format.

Windows Agent Configuration GUI

Create a GUI for Windows to perform the same task as the --setup option when running the agent from the command-line.

Ideally this can be accessed from a Start Menu item if reconfiguration is needed and will be run during the MSI installer.

example on how to use the agent.apache check

Hi,

Trying to find any documentation but ... non found. Am I missing it?

Fallback to ServiceNet for Rackspace Hosts

In order to ease configuration and help customers who do not have access to the public Internet, we could fall back to the service net endpoints for API and AEP.

Initial thoughts are to use the xenstore vm-data to determine if the agent is on a Rax cloud host.

Upgrades say they begin a new log file, but then fail to start

Refactoring lua files

TL;DR: division in virgo has some design issues to address. I'm gonna proceed with a little hacky way for now, but we need to think about the design for long-term run.

Currently all lua scripts, monitoring-related or not, are in virgo instead virgo-base. This is fine for people who just wanna deploy their own monitoring system, but not good for people who wanna develop their own agent based on virgo-base, e.g., a cloudkeep(https://github.com/cloudkeep) agent that uses FUSE to provide keys to applications. The lua files in virgo need to be divided into monitoring-related part (which stays in virgo) and general part (which should be moved into virgo-base.

Here's a diagram I drew while reading:

Red nodes are the ones with monitoring logic. Some of them are easy to deal with, like /check/*.lua, /schedule/scheduler.lua, since they are independent modules that can be simply isolated from the core part. The challenging part is the MonitoringAgent -> ConnectionStream -> AgentClient -> AgentProtocolConnection path. Each of them more of less has something to do with monitoring.

@robert-chiniquy and I thought about providing a base type for each of them, which has core logic such as handshake.hello (authentication) and heartbeat, and use dependency injection to determine what type to use. When users need to extend something, they simply sub-class the base type and inject the type to proper place.

However, the idea above has issues since there are too many levels. Suppose we need to add a new RPC call, helloworld, which uses a new message type helloworld.param. The message type is not a big issue. But we also need to add the handler for helloworld RPC method, which is defined in /protocol/connection.lua. However, it's in the bottom level. In order to support that, the type needs to be specified somehow from top down.

For now, so we are gonna add an argument to constructors, which has types that needs to be overwritten. Then when instantiating objects, if a type is defined for that object, use that type; other wise, use the default one. But this is not a good design in long-term. We probably still need to refactor those scripts in the future for a cleaner design.

Add open files to the filesystem check

number of total files open on the system.

Resurrect statsD support

https://github.com/virgo-agent-toolkit/rackspace-monitoring-agent/pull/459/files

Migrate Master to new Luvit 2.0

generated uuid has dangling -

I got a uuid that looks like this: f5a39211-3b09-4103-c1a8-

@rphillips ideas?

create test.zip and run it as an entry point

Instead of using an external luvit binary to run tests zip up the tests and run via the entry point method.

Uploading crash dump fails with "socket hang up" or "ECANCELED" error

This is a second issue from #489.

I probably wasn't obvious and clear enough in my original pull request description, but there are actually two issues in play here. First one is logging issue which has been fixed in #489 and the second one is the actual crash dump upload issue which I haven't been able to track down yet.

Every time I restart the agent it tries to upload some crash dump files, but it either fails with socket hung up or ECANCELED error.

Second (ECANCELED) error seems to be related to the retried request.

Mon Jan  6 18:42:05 2014 WRN: POST to nil:nil failed for /var/lib/rackspace-monitoring-agent/rackspace-monitoring-agent-crash-report-ponies.dmp with status: ? and error: socket hang up.
Mon Jan  6 18:42:05 2014 DBG: retrying download 1 more times.
Mon Jan  6 18:42:05 2014 WRN: POST to nil:nil failed for /var/lib/rackspace-monitoring-agent/rackspace-monitoring-agent-crash-report-ponies.dmp with status: ? and error: ECANCELED, operation canceled.

If I manually try to send a crash dump from that server using cURL it works just fine.

Could it be some weird socket pooling / re-use issue going on in Luvit since it posts crash dump to the same IP address which is also used to talk to the agent endpoint?

Edit: Here is a screenshot from a tcp dump capture. All those RST's at the end look kinda weird... Might be related?

Reduce logging for non-debug mode

The non-debug logging level prints information regularly. Reduce this to just errors and warnings.

Move the host info logic into a subprocess

When constructing these objects, sigar is called to populate the object. This should actually be populated within the run function to allow everything to be async. A better fix would be to run these in a subprocess.

Simplify Handshake Message

In #636 and https://github.com/racker/ele/pull/2630 we've been talking of expanding the handshake message to better accommodate the new and growing feature list field. The suggestions are tending toward simplify the handshake and breakout the features.

The current handshake (handshake.hello), https://github.com/virgo-agent-toolkit/virgo-base-agent/blob/master/libs/connection.lua#L262, contains the auth token, monitoring id, agent (application) name, versions, and the feature set.

We could reduce the handshake down (again) to more of a plain authenticate if we move the monitoring id and the features into a secondary message (something akin to handshake.identify). This way the hello authenticates and versions the connection while the identify binds and sets up features for this agent.

create state directory and add uuid

/var/run/{agent}/uuid I guess?

Remove Virgo.config

Move the config logic into lua land. Right now it's in C and exported as virgo.config. This does not fit in the newer model.

The task is to read config files in lua and available within virgo-base.

Improve the prompt about entity creation

Now that Cloud Intelligence is fully functional, should we change the following prompt (the line after "Select Option") in agent setup to mention it?

In order to execute checks, the agent must be associated with a Cloud Monitoring Entity.

Please select the Entity that corresponds to this server:
  1. Create a new Entity for this server (not supported by Rackspace Cloud Control Panel)
  2. Do not associate with an Entity

Select Option (e.g., 1, 2): 1
Creating an entity does not work with the Rackspace Cloud Control Panel. Really create an entity? (yes/no) yes

ReadMe update

This is for the README update. The website update is discussed here: #551

Topics to address:
A. What is (the agent)
B. Key Concepts
C. How to Install and Configure
D. How to use it (the agent)
E. Supported OSes
F. Sub-section of specific topics link to:

Check Types and their metrics
Plug-ins
G. Debugging and Troubleshooting
H. Tests
I. Contacts (e.g. IRC and mailing list)

Inspiration:
https://github.com/etsy/statsd

monitoring.zip opened multiple times

I need to look into this. dtruss shows monitoring.zip opened multiple times. Should just require holding state within the module loader.

Tests related files should not be bundled into deployment

Currently both rackspace-monitoring-agent and make test target uses a same bundle (rackspace-monitoring-agent-bundle.zip) built from make_bundle target. It would be nice to have a test bundle target just for testing, which contains tests related stuff, and a release bundle that only have lua scripts required to run. This way we can keep the released bundle clean/minimal.

Add a hostinfo endpoint which tracks out of date packages

This could show all apt-get packages (on Ubuntu) which have un-applied updates. Even cooler would be if it could track which stream they come from (i.e. security vs. general updates).

remove all.gyp?

When I was playing with ninja it was complaining that "all" was an ambiguous target. Paul mentioned virgo has a all.gyp which is a bit weird anyways. Perhaps we are doing something wrong with gyp?

fix updated agent.plugins parameters remove the old check

Add the ability to use HostInfo data as a Check

There is a wide variety of data available using the HostInfo system within the agent. People have expressed interest in using HostInfo data as check data.

Pros:

HostInfo data can then be treated as metrics and stored long-term
Changes to HostInfo has the potential to be alarmed on

Cons:

Some HostInfo data does not map well to a check format

--prefix is completely wrong

Using

./configure --prefix=/usr

causes the make install to fail. Using a normal

make install

with no DESTDIR set works just fine, but for some reason the Makefile wants to put the prefix before the DESTDIR, so running

make DESTDIR=/home/wgiokas/pkg/virgo-git install

when the prefix is set to /usr causes it to try

install -d /usr//home/wgiokas/pkg/virgo-git//usr/bin

Also, when the prefix is set to / and I run the same make install command as before, it still tries to install to

///home/wgiokas/pkg/virgo-git//usr/bin

To conclude (tl;dr):

what the help says (it defaults to a prefix of /usr, not /usr/local)
the prefix should be after the DESTDIR
the prefix should overwrite whatever is setting /usr

Thank you,
kaictl

Add support for weak timers

The GC support shouldn't block the eventloop if there are not any pending events.

Upgrades - Tweak the version string to include only the base version

It should be centos-5-x86_64

/exe/1.1.0-27/centos-5.11-x86_64-rackspace-monitoring-agent-1.1.0-27.sig

Add custom plugins to targets API.

The agent should be able to report which custom plugin files are available for it to run. This would be a list of executable files in /usr/lib/rackspace-monitoring-agent/plugins

Create Apache VHost Check

The Apache check does not display the status of the virtual hosts. http://blog.e-shell.org/132
An alternative would be to enhance the apache check to have the virtual hosts status. The risk is the variation of the number of metrics, and how to map the host name as part of the metrics locator.

Limit a maximum number of established connections per endpoint

I've encountered condition today when agent had more then 500 established connections to the London endpoint.

We should put some kind of sanity check to the agent to make sure there at most 1 established connection per endpoint.

On a side note, we should also put limit inside the endpoint.

[FREEBSD] Build fails because of GCC dependency

FreeBSD 10 release has ditched GCC (completely - it's not even in the base system anymore) in favour of LLVM/CLANG. Because of that build on this target OS fails with this error (after some time build was running):

--- CUT ---
COPY /opt/virgo-0.1.9/out/Debug/jit
ACTION _opt_virgo_0_1_9_base_bundle_gyp_bundle_h_target_bundle /opt/virgo-0.1.9/out/Debug/obj/gen/bundle.h
TOUCH /opt/virgo-0.1.9/out/Debug/obj.target/base/bundle.zip.embed.stamp
TOUCH /opt/virgo-0.1.9/out/Debug/obj.target/base/bundle.zip.stamp
AR(target) /opt/virgo-0.1.9/out/Debug/obj.target/base/deps/libarchive.a
AR(target) /opt/virgo-0.1.9/out/Debug/obj.target/base/deps/liblua_sigar.a
LINK(target) /opt/virgo-0.1.9/out/Debug/minilua
lockf: g++: No such file or directory
gmake[1]: *** [/opt/virgo-0.1.9/out/Debug/minilua] Error 1
gmake[1]: *** Waiting for unfinished jobs....
gmake[1]: Leaving directory `/opt/virgo-0.1.9/out'
gmake: *** [all] Error 2
--- CUT ---

Installing GCC is not a proper solution. There is a reason why FreeBSD 10 release ditched GCC and there is a reason why you get whole pile of warnings while trying to install GCC on this distro from ports.

The monitoring agent should have a way to automatically install a given version of a plugin from some authoritative source

Currently there are two main challenges with agent checks and plugins

There is no authoritative source other than the plugin-contrib repo
We would like to build more agent checks but that means the agent will only become bigger

Add process check

We should add a native agent check that inspects resource use by a program (1 or more processes).
It should:

Take a regex (or similar pattern) that can be used to match a command as a parameter to the check
The check should find all processes matching the pattern supplied, and aggregate metrics across them (ie, we don't report metrics for each individual process).
In terms of metrics, we should ideally report at least CPU and memory use, as well as how many processes were aggregated.
Lets try to make this work on Windows too.

Does it make sense to add a "slow-query" info for agent with mysql check configured?

This is just an idea at this point.

http://www.rackspace.com/knowledge_center/article/access-slow-query-and-general-logs-for-cloud-databases

virgo-agent-toolkit / rackspace-monitoring-agent Goto Github PK