Giter Club home page Giter Club logo

zmon-worker's Introduction

ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests.

ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the OpenTelemetry initiative and recommended others starting their journey to begin there.

If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do.

ZMON Worker

Build Status Coverage Status Latest PyPI version OpenTracing enabled License

ZMON's Python worker is doing the heavy lifting of executing tasks against entities, and evaluating all alerts assigned to check. Tasks are picked up from Redis and the resulting check values plus alert state changes are written back to Redis.

Local Development

Start Redis on localhost:6379:

$ docker run -p 6379:6379 -it redis

Install the required development libraries:

Ubuntu/Debian:

$ sudo apt-get install build-essential python2.7-dev libpq-dev libldap2-dev libsasl2-dev libsnappy-dev libev4 libev-dev freetds-dev
$ sudo pip2 install -r requirements.txt

macOS:

$ brew install python snappy
$ sudo pip install -r requirements.txt

Start the ZMON worker process:

$ python2 -m zmon_worker_monitor

You can query the worker monitor via the REST API:

$ curl http://localhost:8080/status

You can also query the worker monitor via RPC:

$ python2 -m zmon_worker_monitor.rpc_client http://localhost:23500/zmon_rpc list_stats

Running Unit Tests

Run tests via Tox.

$ tox

You can also pass args to pytest via tox, for instance to run specific test case:

$ tox tests/test_kairosdb.py::test_kairosdb_query

Alternative way of running unit tests within Docker:

$ export WORKER_IMAGE=registry.opensource.zalan.do/stups/zmon-worker:cd166
$ docker run -it -u $(id -u) -v $(pwd):/workdir -w /workdir $WORKER_IMAGE python setup.py flake8
$ docker run -it -u $(id -u) -v $(pwd):/workdir -w /workdir $WORKER_IMAGE python setup.py test

Building the Docker Image

$ docker build -t zmon-worker .
$ docker run -it zmon-worker

Running the Docker image

The Docker image supports many configuration options via environment variables. Configuration options are explained in the ZMON Documentation.

zmon-worker's People

Contributors

a1exsh avatar aermakov-zalando avatar alexeyklyukin avatar anton-ryzhov avatar avaczi avatar bkecskemeti avatar csenol avatar cvirus avatar dneuhaeuser-zalando avatar drummerwolli avatar gargravarr avatar heroldus avatar hjacobs avatar jan-m avatar lerovitch avatar lfroment0 avatar lorenzhawkes avatar losbossos avatar mohabusama avatar mroderick avatar mtesseract avatar olevchyk avatar pitr avatar porrl avatar prayerslayer avatar szuecs avatar tkrop avatar twz123 avatar vetinari avatar whiskeysierra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zmon-worker's Issues

Add DNS wrapper

Right now resolve is added to TCP wrapper, which is not exactly related related to TCP.

Adding DNS wrapper would make more sense here.

Allow more control over email body

Ideally I could pick from multiple templates, but for now it would be okay to just pass some flags like include_value, include_definition etc to the default template(s).

Allow querying CloudWatch without additional "list_metrics" call

We can optimize the CloudWatch wrapper (and reduce the probability of running into stupid AWS rate limits) by allowing using it without the "list_metrics" call:

Introduce new method (e.g. "query_one") to directly call get_metric_statistics if all parameters are known.

Create /health http endpoint in master process

We want to create a /health/ endpoint in our master cherrypy process that reflects the status of the system.

Background:
The master process, which spawns all the workers, contains a cherrypy HTTP server and an RPC server for internal communication with its child processes.
Each child worker process has a Main thread which runs the ZMON checks, and a Reactor thread which react in special circumstances and report it to the master via RPC call. Currently the only functionality the Reactor thread has is for detecting when the main thread is stuck in a long check and triggering an RPC call for the master process to terminate this child worker.
We want to expand the Reactor Thread to periodically report its health status to the master process. The master process will aggregate the health feedback it receives from all child workers in a way that it can be presented it in a HTTP endpoint that reflects when the whole system is malfunctioning.

Proposed specs:

endpoint: /health
return:

  • 200 OK: System healthy
  • 503 Service Unavailable: System unhealthy

Criteria for unhealthy system:

  • If n/2 + 1 worker processes are not responding, meaning they have stopped contacting the rpc server.
  • If n/2 + 1 worker processes were killed (because they got stuck) or died for unknown causes in some unit of time (30 min?)

what else...?

Consider limiting the size of a single check result (total bytes and number of keys)

The size of check results is currently unbounded, this leads to problems where users generate (mostly accidentally) Megabytes (!) of result data for a single check. As the data is stored in JSON format in Redis (and additionally in KairosDB), we might run into memory issues (e.g. Redis memory fragmentation and total database size).

A simple and effective approach would be to introduce a reasonable (configurable) maximum for check results, let's say 64KiB.

Fix flaky unit test

The main worker test (using multiple processes) sometimes fails:

        for string in expected_strings:
>           assert string in data['zmon:checks:123:77']
E           TypeError: 'NoneType' object has no attribute '__getitem__'

Consider moving SNMP and Nagios plugins to "extra" plugins

I think we should have a clean set of "core" plugins which are 100% supported and unit tested:

  • HTTP
  • Time
  • ZMON
  • PostgreSQL
  • ..

Some plugins such as SNMP and Nagios are currently not very useful and should move to a new "extra" plugin section. The "extra" plugins should be located in the same git repo, but they should only be loaded on-demand by setting an environment variable.

Benefits:

  • Clean separation of 100% supported plugins and "legacy" stuff
  • We can exclude the "extra" plugins from code coverage as we will probably never write a full test suite for them
  • Startup and test time is faster as less code is loaded

HTTP wrapper response object

In certain cases, returning response object could be needed. One use case is a REST API with pagination headers (Link). In this case, both Response JSON and Headers are required to complete the check. HEAD method is not always expected to return neither Link headers nor payload with pagination links.

Suggestion is to either return requests.Response object or return a simplified ZmonHTTPResponse with fixed properties (headers, json(), status_code, text, ok).

Fix EventLog

We still use the file-based eventlog Python module which does not properly work in a Docker-context (files are written within the Docker container's filesystem..)

Add support for custom config variables

Could be useful in supplying special variables that are accessible to all check commands. One use case is authorization tokens that can be used in http wrapper to initiate authorized requests.

Suggestion:

Store in a dict in config

New command, Eg: secrets() or vars() etc ...

Example usage

vars('my_service_token')

Capture is not working correctly for HipChat notifications.

It seem as the {{}}-pattern substitution for capture is not working for HipChat notifications, if the message is given explicitly. E.g. in XXX is producing the first line produces

ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-test-463[aws:xxx:eu-central-1]

instead of

ALERT ENDED: Balance AWS: Business Partner Service Not Found Responses in Last Hour ({details}) on ad-app-tier-business-xxx-service-481[aws:xxx:eu-central-1]

ping() does not work in Docker image

der Check-Command ping() liefert mir ein "[Errno 2] No such file or directory"
in unseren System aber auch unter demo.zmon.io.

Aktuell habe ich nur einen Ping-Check mit folgenden Inhalt:
ping()

cloudwatch scraping may fail with wildcard dimensions

The check

cloudwatch().query({'AvailabilityZone': 'NOT_SET', 'AutoScalingGroupName': 'tailor-*' }, 'NetworkIn', 'Average')

may fail if the amount of metrics returned exceed the boto3 cloudwatch metrics page size (currently 500)

ZMON does the filtering and only uses the first page.

Get rid of CherryPy configuration file (web.conf)

We are still using the CherryPy configuration file format inside ZMON Worker, but actually app.py writes environment variable values to it.

Get rid of this legacy dependency and use environment variables (+ args) directly.

Improve logging

Logging is not very helpful right now:

2015-12-22 05:03:06,306 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
 Dec 22 05:03:06 ip-172-31-163-67 docker/b6eb55fa92b8[840]: 2015-12-22 05:03:06,382 - INFO - zmon_worker_monitor.zmon_worker.tasks.notacelery_task - send_metrics - Send metrics, end storing metrics in redis count: 0, duration: 0.002s
 Dec 22 05:03:07 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:07,125 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=20, count: 146
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=14, count: 13276
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,032 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=17, count: 12994
 Dec 22 05:03:08 ip-172-31-136-190 docker/d65d8e292bd8[839]: 2015-12-22 05:03:08,031 - INFO - zmon_worker_monitor.redis_context_manager - __exit__ - IdleLoop: No task received... pid=16, count: 13041
  • Remove date/time prefix (already provided by syslog)
  • Reduce number of non-relevant log lines (e.g. idle loop)

Add an option to disable redirects in "http" check command.

For the check defined like

def check():
  status_code = http('https://service.dns.name/file/very-large-file.zip').code()
  return {"status_code": status_code}

I would like to have status_code = 302 when the server returns redirect. The reason is that I only need to check that the very-large-file.zip is accessible, but I don't want to download this file in ZMON.

Currently this check returns status_code = 200, which means that ZMON follow redirects. Would be good to somehow disable such behaviour.

Resilience to broken downtime if we want to.

Ideally this should be handled gracefully triggering the alert, right now this does not get executed or reporated add all

ERROR [worker-35] zmon_worker_monitor.zmon_worker.tasks.main/notify: Notification for check Webapp HTTP Status reached soft time limit
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1445, in notify
    downtimes = self._evaluate_downtimes(alert_id, entity_id)
  File "/usr/local/lib/python2.7/dist-packages/zmon_worker-cd156-py2.7.egg/zmon_worker_monitor/zmon_worker/tasks/main.py", line 1635, in _evaluate_downtimes
    if now > d['start_time'] and now < d['end_time']:
KeyError: 'start_time'

Cassandra CQL exception with python 2.7.12

Cassandra wrapper execute raises exception with python 2.7.12 (cassandra-driver 2.7.2)

('Unable to connect to any servers', {'cassandra-node': TypeError('ref() does not take keyword arguments',)})

Support for epochs

There are some APIs that return different timestamps using epoch (ZMON's is one of these, to give an example). It would be nice to have the time() helper function handling these. One further improvement could possibly be adding the datetime.strptime() functionality to make life easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.