Giter Club home page Giter Club logo

canary-checker's Introduction

Kubernetes Native Health Check Platform


Canary checker is a kubernetes-native platform for monitoring health across application and infrastructure using both passive and active (synthetic) mechanisms.

Features

  • Batteries Included - 35+ built-in check types
  • Kubernetes Native - Health checks (or canaries) are CRD's that reflect health via the status field, making them compatible with GitOps, Flux Health Checks, Argo, Helm, etc..
  • Secret Management - Leverage K8S secrets and configmaps for authentication and connection details
  • Prometheus - Prometheus compatible metrics are exposed at /metrics. A Grafana Dashboard is also available.
  • Dependency Free - Runs an embedded postgres instance by default, can also be configured to use an external database.
  • JUnit Export (CI/CD) - Export health check results to JUnit format for integration into CI/CD pipelines
  • JUnit Import (k6/newman/puppeter/etc) - Use any container that creates JUnit test results
  • Scriptable - Go templates, Javascript and CEL can be used to:
    • Evaluate whether a check is passing and severity to use when failing
    • Extract a user friendly error message
    • Transform and filter check responses into individual check results
    • Extract custom metrics
  • Multi-Modal - While designed as a Kubernetes Operator, canary checker can also run as a CLI and a server without K8s

Getting Started

  1. Install canary checker with Helm
helm repo add flanksource https://flanksource.github.io/charts
helm repo update

helm install \
  canary-checker \
  flanksource/canary-checker \
 -n canary-checker \
 --create-namespace
 --wait
  1. Create a new check
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: http-check
spec:
interval: 30
http:
  - name: basic-check
    url: https://httpbin.demo.aws.flanksource.com/status/200
  - name: failing-check
    url: https://httpbin.demo.aws.flanksource.com/status/500

2a. Run the check locally (Optional)

wget  https://github.com/flanksource/canary-checker/releases/latest/download/canary-checker_linux_amd64 \
-O canary-checker &&  chmod +x canary-checker
./canary-checker run canary.yaml

asciicast

  1. Apply the check
kubectl apply -f canary.yaml
  1. Check the health status
kubectl get canary
NAME               INTERVAL   STATUS   LAST CHECK   UPTIME 1H        LATENCY 1H   LAST TRANSITIONED
http-check.        30         Passed   13s          18/18 (100.0%)   480ms        13s

See fixtures for more examples and docs for more comprehensive documentation.

Use Cases

Synthetic Testing

Run simple HTTP/DNS/ICMP probes or more advanced full test suites using JMeter, K6, Playright, Postman.

# Run a container that executes a playwright test, and then collect the
# JUnit formatted test results from the /tmp folder
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: playwright-junit
spec:
  interval: 120
  junit:
    - testResults: "/tmp/"
      name: playwright-junit
      spec:
        containers:
          - name: playwright
            image: ghcr.io/flanksource/canary-playwright:latest

Infrastructure Testing

Verify that infrastructure is fully operational by deploying new pods, spinning up new EC2 instances and pushing/pulling from docker and helm repositories.

# Schedule a new pod with an ingress and then time how long it takes to schedule, be ready, respond to an http request and finally be cleaned up.
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: pod-check
spec:
  interval: 30
  pod:
    - name: golang
      spec: |
        apiVersion: v1
        kind: Pod
        metadata:
          name: hello-world-golang
          namespace: default
          labels:
            app: hello-world-golang
        spec:
          containers:
            - name: hello
              image: quay.io/toni0/hello-webserver-golang:latest
      port: 8080
      path: /foo/bar
      scheduleTimeout: 20000
      readyTimeout: 10000
      httpTimeout: 7000
      deleteTimeout: 12000
      ingressTimeout: 10000
      deadline: 60000
      httpRetryInterval: 200
      expectedContent: bar
      expectedHttpStatuses: [200, 201, 202]

Backup Checks / Batch File Monitoring

Check that batch file processes are functioning correctly by checking the age and size of files in local file systems, SFTP, SMB, S3 and GCS.

# Checks that a recent DB backup has been uploaded
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: folder-check
spec:
  schedule: 0 22 * * *
  folder:
    - path: s3://database-backups/prod
      name: prod-backup
      maxAge: 1d
      minSize: 10gb

Alert Aggregation

Aggregate alerts and recommendations from Prometheus, AWS Cloudwatch, Dynatrace, etc.

apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: alertmanager-check
spec:
  schedule: "*/5 * * * *"
  alertmanager:
    - url: alertmanager.monitoring.svc
      alerts:
        - .*
      ignore:
        - KubeScheduler.*
        - Watchdog
      transform:
        # for each alert, transform it into a new check
        javascript: |
          var out = _.map(results, function(r) {
            return {
              name: r.name,
              labels: r.labels,
              icon: 'alert',
              message: r.message,
              description: r.message,
            }
          })
          JSON.stringify(out);

Prometheus Exporter Replacement

Export custom metrics from the result of any check, making it possible to replace various other promethus exporters that collect metrics via HTTP, SQL, etc..

apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: exchange-rates
spec:
  schedule: "every 1 @hour"
  http:
    - name: exchange-rates
      url: https://api.frankfurter.app/latest?from=USD&to=GBP,EUR,ILS
      metrics:
        - name: exchange_rate
          type: gauge
          value: result.json.rates.GBP
          labels:
            - name: "from"
              value: "USD"
            - name: to
              value: GBP

Platform Ready

Canary checker is ideal for building platforms, developers can include health checks for their applications in whatever tooling they prefer, with secret management that uses native Kubernetes constructs.

apiVersion: v1
kind: Secret
metadata:
  name:  basic-auth
stringData:
   user: john
   pass: doe
---
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
  name: http-basic-auth-configmap
spec:
  http:
    - url: https://httpbin.demo.aws.flanksource.com/basic-auth/john/doe
      username:
        valueFrom:
          secretKeyRef:
            name: basic-auth
            key: user
      password:
        valueFrom:
          secretKeyRef:
            name: basic-auth
            key: pass

Dashboard

Canary checker comes with a built-in dashboard by default

There is also a grafana dashboard, or build your own using the metrics exposed.

Getting Help

If you have any questions about canary checker:

Your feedback is always welcome!

Check Types

Protocol Status Checks
HTTP(s) GA Response body, headers and duration
DNS GA Response and duration
Ping/ICMP GA Duration and packet loss
TCP GA Port is open and connectable
Data Sources
SQL (MySQL, Postgres, SQL Server) GA Ability to login, results, duration, health exposed via stored procedures
LDAP GA Ability to login, response time
ElasticSearch / Opensearch GA Ability to login, response time, size of search results
Mongo Beta Ability to login, results, duration,
Redis GA Ability to login, results, duration,
Prometheus GA Ability to login, results, duration,
Alerts Prometheus
Prometheus Alert Manager GA Pending and firing alerts
AWS Cloudwatch Alarms GA Pending and firing alarms
Dynatrace Problems Beta Problems deteced
DevOps
Git GA Query Git and Github repositories via SQL
Azure Devops Beta
Integration Testing
JMeter Beta Runs and checks the result of a JMeter test
JUnit / BYO Beta Run a pod that saves Junit test results
K6 Beta Runs K6 tests that export JUnit via a container
Newman Beta Runs Newman / Postman tests that export JUnit via a container
Playwright Beta Runs Playwright tests that export JUnit via a container
File Systems / Batch
Local Disk / NFS GA Check folders for files that are: too few/many, too old/new, too small/large
S3 GA Check contents of AWS S3 Buckets
GCS GA Check contents of Google Cloud Storage Buckets
SFTP GA Check contents of folders over SFTP
SMB / CIFS GA Check contents of folders over SMB/CIFS
Config
AWS Config GA Query AWS config using SQL
AWS Config Rule GA AWS Config Rules that are firing, Custom AWS Config queries
Config DB GA Custom config queries for Mission Control Config D
Kubernetes Resources GA Kubernetes resources that are missing or are in a non-ready state
Backups
GCP Databases GA Backup freshness
Restic Beta Backup freshness and integrity
Infrastructure
EC2 GA Ability to launch new EC2 instances
Kubernetes Ingress GA Ability to schedule and then route traffic via an ingress to a pod
Docker/Containerd Deprecated Ability to push and pull containers via docker/containerd
Helm Deprecated Ability to push and pull helm charts
S3 Protocol GA Ability to read/write/list objects on an S3 compatible object store

Contributing

See CONTRIBUTING.md

Thank you to all our contributors !

License

Canary Checker core (the code in this repository) is licensed under Apache 2.0 and accepts contributions via GitHub pull requests after signing a CLA.

The UI (Dashboard) is free to use with canary checker under a license exception of Flanksource UI

canary-checker's People

Contributors

adityathebe avatar ajatprabha avatar brendangalloway avatar ciju avatar dabasvibhor avatar dependabot[bot] avatar flankbot avatar gjagnoor avatar ikropotov avatar joeshiett avatar junaid-ebrahim avatar kaitou786 avatar moshloop avatar msergg avatar paddatrapper avatar parth-gohil avatar philipstaffordwood avatar prashanth-nelli avatar rouxblouw avatar rubenharutyunov avatar sainivikas avatar step-security-bot avatar talboren avatar teodor-pripoae avatar vallard avatar wwwlde avatar yashmehrotra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

canary-checker's Issues

OPA Policy

Run OPA policy against all objects defined in a namespace
Input:

namespace: *
objects:
- deployments
- pods
- service
opa:
- fromBundle: http://path/to/opa/bundle.zip
- fromConfigMap: policies
- fromValue: {...}

Metrics:
Passing / Failing objects

S3 Bucket Scan

A canary that can be used to check that backups are being made and that they are non-empty

Input:

bucket:
accessKey:
secretKey:
endpoint:
objectPath: # glob path to restricy matches to a subet
readWrite: false
maxAge: # maximum allowed age of matched objects
minSize: # min size of of most recent matched object

Output: Pass/Failnvalid

Metrics:

  • Total Size
  • Object Count
  • Last write time

Tagging

Each check should allow specifying a tag e.g. tag: critical which then gets pushed through to prometheus.

LDAP lookup / auth

Input

host: ldaps://hostname:636
username: admin
password: admin
bindDn: 
userSearch: 

Output: Pass/Fail/Invalid

Metrics:

  • Lookup time
  • Record count (min,max,avg)

Namespace check

Similar to pod check, but run first create a new namespace, and then delete it afterwards.

ICMP Check

Input:

# If DNS, test connectivity for each returned record
endpoints: []
# max response time
thresholdMillis: 300
# max percentage packet loss
packetLossThreshold: 0.01

Metrics (per endpoint)

  • latency
  • packetLoss

Deadline refactoring

Also noticed some unused things here
https://github.com/flanksource/canary-checker/blob/master/checks/namespace.go#L58

checks/namespace.go:58
deadline := time.Now().Add(time.Duration(config.Interval) * time.Second)
this deadline is calculated, but is not used in the Check()
same for pod.go
https://github.com/flanksource/canary-checker/blob/master/checks/pod.go#L75
checks/pod.go:75
deadline := time.Now().Add(time.Duration(config.Interval) * time.Second)

Moshe Immerman
Yeh - it needs to be normalized
the deadline is used here: https://github.com/flanksource/canary-checker/blob/master/checks/namespace.go#L170
checks/namespace.go:170
deadline := time.Now().Add(time.Duration(check.Deadline) * time.Millisecond)

although that should probably be a httpTimeout
deadlines should probably be based on the interval, and implemented at a higher level
including context cancellation

Node selector

Choose where to run the checks from - can be a single node or multiple
If multiple - all checks must pass, and a prometheus metric is exported per node

TCP Check

Connect and then close a connection to a port

Fix Tooltip overlap

Also need to look at the tooltips - Currently they overlap, and they should probably update a single window rather than create and fade

Pod Check - Reuse pod name

Reuse pod names (maybe in a rolling fashion) to prevent too many metrics series from being created which can bloat prometheus

PersistentVolume

Test that a persistent volume of specified class & size can be:

  • Created
  • Folders created
  • Files created, updated, read and deleted
  • Folders Deleted
  • Deleted and cleaned up

HTTP Check

Input:

# If DNS, test connectivity for each returned record
endpoints: 
   - https://google.com
   - http://google.com:443
# max response time
thresholdMillis: 300
responseCodes: [201,200,301]
responseContent: "OK"
maxSSLExpiry: 60d

Metrics (per endpoint)

  • response_time
  • response_code
  • ssl_certificate_expiry

DNS Lookup

Input:

  • DNS Server host and port
  • DNS Query
  • DNS Query Type (default to A)
  • Min number of records to return
  • Exact reply match
  • Timeout

Output: Pass/Fail/Invalid

Metrics:

  • Lookup time
  • Record count (min,max,avg)

S3 Bucket test

Input:

bucket:
accessKey:
secretKey:
endpoint:
objectPath:
readWrite: false

Output: Pass/Failnvalid

Metrics:

  • Lookup time
  • List / Update / Read

Create CRD & Controller

apiVersion: canary.flanksource.com/v1
kind: CanaryCheck
spec:
    interval: 60s
    secretRefs:
        - secrets
    config: #json.rawMessage

The controller should update the status for each CRD after each check and fire events for failures, to enable kubectl get canary

We can either deploy a new canary-checker instance per CRD and aggregate results or run all the canaries from the controller,

UI - Improve check reporting graphics

Currently, the UI just displays a Red or Green square with no visual indicator of age or duration.

Possible designs:

Now  ---------------      5m Ago
✓.    ✗ ✓     ✗              ✗
5h Ago ---------------    2d Ago
    ✓.    ✗ ✓     ✗               ✓   

Could use red/green circles and then the size of the circle to indicate the duration

Docker Image Pull

Test that a docker image can be pulled

Input:

image: docker.io/busybox
username: 
password:
expectedDigest:  abcdef123
expectedSize: 200m

Metrics (per endpoint)

  • pull_time
  • layers
  • size

Postresql SQL execute

Input

connection: user=pqgotest dbname=pqgotest sslmode=verify-full
query:  SELECT 1
results: 1

Output: Pass/Fail/Invalid

Metrics:

  • response_time
  • Record count (min,max,avg)

Prometheus Export

/metrics

Export a snapshot of metrics created by all configured checks

POD Schedule & Serve

Given the pod spec, create a Pod, service and ingress, schedule it and then confirm that it serves results correctly via the ingress

# Full pod spec
spec:
   ....
thresholdMillis: 300
expectedContent: OK

Metrics:

  • schedule_time (from Pending to ContainerCreating)
  • creation_time (from ContainerCreatning to Running)
  • destroy_time (from Terminating to Terminated)
  • ingress_time (from Running to Ingress check succeeds)
  • serve_time (Response time via ingress)

Intermittent nil pointer

[PASS] <VALID> [icmp] 172.24.130.12 duration=2 [] Succesffully checked
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1f1d2aa]

goroutine 61 [running]:
github.com/flanksource/canary-checker/cmd.(*State).AddCheck(0x3f6afd0, 0xc0000a9620)
	/app/cmd/serve.go:221 +0x9a
github.com/flanksource/canary-checker/cmd.glob..func3.2(0xc00040a240, 0x2b2de40, 0x41209d0)
	/app/cmd/serve.go:68 +0x54
created by github.com/flanksource/canary-checker/cmd.glob..func3
	/app/cmd/serve.go:66 +0x680

Status Page

Create a self-reloading status page that displays a summary from all configured checks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.