Giter Club home page Giter Club logo

oncall's Introduction

Grafana OnCall

Latest Release License Docker Pulls Slack Discussion Build Status

Developer-friendly incident response with brilliant Slack integration.

  • Collect and analyze alerts from multiple monitoring systems
  • On-call rotations based on schedules
  • Automatic escalations
  • Phone calls, SMS, Slack, Telegram notifications

Getting Started

We prepared multiple environments:

  1. Download docker-compose.yml:

    curl -fsSL https://raw.githubusercontent.com/grafana/oncall/dev/docker-compose.yml -o docker-compose.yml
  2. Set variables:

    echo "DOMAIN=http://localhost:8080
    # Remove 'with_grafana' below if you want to use existing grafana
    # Add 'with_prometheus' below to optionally enable a local prometheus for oncall metrics
    # e.g. COMPOSE_PROFILES=with_grafana,with_prometheus
    COMPOSE_PROFILES=with_grafana
    # to setup an auth token for prometheus exporter metrics:
    # PROMETHEUS_EXPORTER_SECRET=my_random_prometheus_secret
    # also, make sure to enable the /metrics endpoint:
    # FEATURE_PROMETHEUS_EXPORTER_ENABLED=True
    SECRET_KEY=my_random_secret_must_be_more_than_32_characters_long" > .env
  3. (Optional) If you want to enable/setup the prometheus metrics exporter (besides the changes above), create a prometheus.yml file (replacing my_random_prometheus_secret accordingly), next to your docker-compose.yml:

    echo "global:
      scrape_interval:     15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: prometheus
        metrics_path: /metrics/
        authorization:
          credentials: my_random_prometheus_secret
        static_configs:
          - targets: [\"host.docker.internal:8080\"]" > prometheus.yml

    NOTE: you will need to setup a Prometheus datasource using http://prometheus:9090 as the URL in the Grafana UI.

  4. Launch services:

    docker-compose pull && docker-compose up -d
  5. Go to OnCall Plugin Configuration, using log in credentials as defined above: admin/admin (or find OnCall plugin in configuration->plugins) and connect OnCall plugin with OnCall backend:

    OnCall backend URL: http://engine:8080
    
  6. Enjoy! Check our OSS docs if you want to set up Slack, Telegram, Twilio or SMS/calls through Grafana Cloud.

Update version

To update your Grafana OnCall hobby environment:

# Update Docker image
docker-compose pull engine

# Re-deploy
docker-compose up -d

After updating the engine, you'll also need to click the "Update" button on the plugin version page. See Grafana docs for more info on updating Grafana plugins.

Join community

Have a question, comment or feedback? Don't be afraid to open an issue!

Stargazers over time

Stargazers over time

Further Reading

oncall's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oncall's Issues

Simple install without Docker or Helm/k8s ?

Hi Folks,

It would be nice to have a simple/plain install option or instructions for Oncall.
The current options all heavily imply existence of Docker/Helm/Kubernetes and this is clearly not the case in all environments.

Thanks

Calculate OnCall schedule quality

Alice, Bob, Sam are on call. I think it would be nice to show some "schedule quality" to help users keep schedules in a good shape.

100% = ((50% * balance) + (50% * working_hours_balance)) * ( 1 / gaps_rate)

We're trying to balance on call hours between engineers on call. Balance = 1 is good. Balance = 0 is bad.

balance = abs (min(Alice, Bob) hours / max(Alice, Bob) hours) * .... all pairs

We're trying to avoid people being on call outside of working hours. At least their on call time outside of working hours should be equal.

working_hours_balance = abs (min(Alice, Bob) hours outside of working hours / max (Alice, Bob) hours outside of working hours) * .... all pairs

We want to avoid gaps in on call schedule. Shift with no users is equal to gap:

gaps_rate = gaps hours / total observed period hours 

Can't connect to existing grafana

Hi
I have run OnCall as hobby environment via docker-compose (without Grafana) on the same host with existing Grafana (v9.0.1) running as binary.
Than I tried to configure OnCall Plugin (v1.0.3) and get error:

Tried connecting to OnCall: http://localhost:8080
Error: Request failed with status code 404, retry or check settings & re-initialize.

Error connecting to onCall - On a kubernetes cluster

Hello everyone,

I have an existing kubernetes cluter with grafana already running, and today i installed Oncall using the helm chart provided in the repository.
The Oncall chart has been installed successfully in a different namespace than grafana.
When i try to configure the Oncall plugin on grafana i get the following error message, even though everything has been set properly

Tried connecting to OnCall: http://release-oncall-engine.oncall.svc.cluster.local:8080
Error: Request failed with status code 404, retry or check settings & re-initialize

Thank you

"Tiny" deployment version for those who want a small deployment footprint?

A few HN commenters raised a topic of a large deployment footprint: https://news.ycombinator.com/item?id=31740902

We deliver OnCall with redis + rabbit + mariadb + workers + web server because we focus on reliability a lot. For some users it's not a priority. We could think about some "tiny" deployment environment based on sqlite.

  • Django UWSGI web server -> SQLite
  • Celery workers -> One celery worker for all queues using Celery-SQLAlchemy -> SQLite (https://docs.celeryq.dev/en/v5.2.7/getting-started/backends-and-brokers/index.html#sqlalchemy). It's marked as not stable in docs btw. Also our celery version is 4.3.
  • Redis -> We could go without cache at all or with in-memory Django engine cache. Probably impacted parts:
    • Alert Group page in web.
    • Grouping performance for heavy alert storm from Alertmanager. Typical K8S-generated alert storm.
    • Google Calendar parser relies on caching a lot.
    • Anything else?
  • Rabbit -> Could be removed.
  • MySQL -> Could be removed.

In total we'll need uWSGI + celery. Both could be packed in the one docker container.

Also we should make users absolutely aware of the performance/reliability limitations if they choose such a deployment.

Any thoughts?

Add postgresql support

As Django is used in the core of oncall. It will be perfect for adding the ability to use PostgreSQL as the optional database engine.

For example, some teams prefer PostgreSQL over MySQL or already have production-ready PostgreSQL clusters.

Scheduled maintenance support

As for now we only have an ability to create maintenance that starts right now.
As owner of service i would like to create scheduled maintenance that will start at some time in future.
It would be really great if every scheduled maintenance would have some kind of workflow attached to it for example simple approval would be greatly appreciated.

Add support for external MySQL instances

Since I've already got a locally running MySQL server, it conflicts with the one that the compose file is trying to setup. It would be great if there was out of the box support for referencing an external MySQL server.

Add raw alert content view

We show raw content of last alert in the template editor (payload example):
Screenshot 2022-06-14 at 16 04 57

Users are missing the ability to check raw content of some specific alerts in the history. May be add the button here: ?
Screenshot 2022-06-14 at 16 05 09

Maintenance page does not respect teams

Description

  1. Maintenance page shows all the maintenances from all teams
  2. If integration does not belong to the team it's name not displayed in the table
  3. If integration does not belong to the team it's not possible to stop it

Steps to reproduce

  1. In Grafana OnCall choose Team 1 in the top right corner
  2. Create any integration on the Integrations page
  3. Create a maintenance on Maintenance page
  4. Switch to Team2
  5. Try to stop maintenance

Screenshot

Screenshot 2022-06-30 at 17 22 27

Prometeus Exporter oncall

There is a proposal to make Prometeus Exporter for application monitoring. Monitoring such a critical service is also important if it is deployed locally.

Make explicit which template is rendered where

We have template rendering preview in the integration settings. For now all templates are rendered one under another without titles or annotations. It would be nice to add some title for each template and distinguish them somehow.

Screenshot 2022-06-14 at 15 58 49

Actions support

As oncall engineer i would like to take some actions from chat ops like interface to resolve common issues or launch predefined set of actions to debug environment.

Phone calls and sms doesn't work on hobby envrionment

While sending phone notifications we are providing status_callback which is builded from settings.BASE_URL. In hobby deployment BASE_URL=localhost, so Twilio responds with:
The StatusCallback URL http://localhost:8000/twilioapp/sms_status_events/ is not a valid URL.

It's needed to check if helm has this issue too @iskhakov

Add arm64 docker image for grafana/oncall

When running the following command from README.md: docker-compose --env-file .env_hobby -f docker-compose.yml up --build -d, uWSGI fails to start on my M1 Mac.

uWSGI logs:

[uWSGI] getting INI configuration from uwsgi.ini
[log-encoder] registered format ${strftime:%Y-%m-%d %H:%M:%S} ${msgnl}
2022-06-15 08:06:33 *** Starting uWSGI 2.0.20 (64bit) on [Wed Jun 15 08:06:33 2022] ***
2022-06-15 08:06:33 compiled with version: 11.2.1 20220219 on 14 June 2022 18:16:43
2022-06-15 08:06:33 os: Linux-5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022
2022-06-15 08:06:33 nodename: cfcc13d1c266
2022-06-15 08:06:33 machine: x86_64
2022-06-15 08:06:33 clock source: unix
2022-06-15 08:06:33 pcre jit disabled
2022-06-15 08:06:33 detected number of CPU cores: 4
2022-06-15 08:06:33 current working directory: /etc/app
2022-06-15 08:06:33 writing pidfile to /tmp/project-master.pid
2022-06-15 08:06:33 detected binary path: /usr/local/bin/uwsgi
2022-06-15 08:06:33 uWSGI running as root, you can use --uid/--gid/--chroot options
2022-06-15 08:06:33 setgid() to 2000
2022-06-15 08:06:33 setuid() to 1000
2022-06-15 08:06:33 chdir() to /etc/app
2022-06-15 08:06:33 your memory page size is 4096 bytes
2022-06-15 08:06:33 detected max file descriptor number: 1048576
2022-06-15 08:06:33 lock engine: pthread robust mutexes
2022-06-15 08:06:33 unable to set PTHREAD_PRIO_INHERIT

When building the grafana/oncall image manually on my machine and using it in the docker-compose file, all works as expected. My proposal is to add a docker image for arm64 to DockerHub, so the instruction works for Apple Silicon/arm64 users.

401 when more that 1 grafana.

We Using HA grafana with postgresql endpoint and OIDC auth.

But some times getting smth like that

image

[grafana-oncall-engine-7d8dddb54c-6fsks oncall] 2022-06-20 06:32:12 source=engine:app google_trace_id=none logger=root inbound latency=0.010145 status=202 method=POST path=/api/internal/v1/plugin/sync content-length=0 slow=0 integration_type=N/A integration_token=N/A
[grafana-oncall-engine-7d8dddb54c-6fsks oncall] 2022-06-20 06:32:12 source=engine:uwsgi status=202 method=POST path=/api/internal/v1/plugin/sync latency=0.010862 google_trace_id=- protocol=HTTP/1.1 resp_size=277 req_body_size=0
[grafana-1 grafana] logger=context traceID=6af451b8f211da18 t=2022-06-20T06:32:12.879886186Z level=error msg="invalid API key" error="invalid API key" traceID=6af451b8f211da18
[grafana-1 grafana] logger=context traceID=6af451b8f211da18 userId=0 orgId=0 uname= t=2022-06-20T06:32:12.879972941Z level=info msg="Request Completed" method=GET path=/api/org/users status=401 remote_addr=127.0.0.6 time_ms=0 duration=129.997µs size=59 referer= traceID=6af451b8f211da18

Seems like we need add "sticky sessions" but didnt full understand - why?

Issue when trying to install with helm

Hi,
I have encountered an issue when trying to install the helm chart on my local kubernetes. I get the following message:

Error: INSTALLATION FAILED: template: oncall/templates/engine/job-migrate.yaml:48:14: executing "oncall/templates/engine/job-migrate.yaml" at <include "snippet.redis.env" .>: error calling include: templa te: oncall/templates/_env.tpl:157:12: executing "snippet.redis.env" at <include "snippet.redis.host" .>: error calling include: template: oncall/templates/_env.tpl:140:46: executing "snippet.redis.host" a t <.Values.externalRedis.host>: nil pointer evaluating interface {}.host

Regards,
Patrick

Display a sample command when creating API token

On the Grafana OnCall->Settings page when you create a new API token it would be nice if it also displayed an example curl command to make use of the token for testing instead of needing to lookup in the docs. This would match the behavior that Grafana has when you create an API key in its settings page.

Slack something went wrong when authorizing this app.

Hello!

Using docker-compose https://raw.githubusercontent.com/grafana/oncall/dev/docker-compose.yml I start Oncall with this command docker-compose --env-file .env_hobby up -d and then configure slack integration. Integration works, but when user want to connect slack account he gets this error:

Something went wrong when authorizing this app.
Error details
Invalid client_id parameter

Do I need to do any additional configuration?
To get public url for slack I use localtunnel.

silence/pause calls / autoaknowledge for N min

It'll be nice have easy access function to pause calls or autoaknowledge incidents.
In time when you have big crash you have a lot alerts. Its hard to aknowledge and to make work in same time.

Docker compose can not start, error engine

Dear friends,

After I did as like as the docs, I can not start up service. Could you please let me know what steps I'm missing? The error log from engine. I don't use built-in Grafana

[uWSGI] getting INI configuration from uwsgi.ini
[log-encoder] registered format ${strftime:%Y-%m-%d %H:%M:%S} ${msgnl}
2022-06-15 05:32:22 *** Starting uWSGI 2.0.20 (64bit) on [Wed Jun 15 05:32:22 2022] ***
2022-06-15 05:32:22 compiled with version: 11.2.1 20220219 on 14 June 2022 18:16:43
2022-06-15 05:32:22 os: Linux-3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022
2022-06-15 05:32:22 nodename: fe317d18eb64
2022-06-15 05:32:22 machine: x86_64
2022-06-15 05:32:22 clock source: unix
2022-06-15 05:32:22 pcre jit disabled
2022-06-15 05:32:22 detected number of CPU cores: 2
2022-06-15 05:32:22 current working directory: /etc/app
2022-06-15 05:32:22 writing pidfile to /tmp/project-master.pid
2022-06-15 05:32:22 detected binary path: /usr/local/bin/uwsgi
2022-06-15 05:32:22 uWSGI running as root, you can use --uid/--gid/--chroot options
2022-06-15 05:32:22 setgid() to 2000
2022-06-15 05:32:22 setuid() to 1000
2022-06-15 05:32:22 chdir() to /etc/app
2022-06-15 05:32:22 your memory page size is 4096 bytes
2022-06-15 05:32:22 detected max file descriptor number: 1048576
2022-06-15 05:32:22 lock engine: pthread robust mutexes
2022-06-15 05:32:22 thunder lock: disabled (you can enable it with --thunder-lock)
2022-06-15 05:32:22 Listen queue size is greater than the system max net.core.somaxconn (128).

Also I increased net.core.somaxconn to 512

Screen Shot 2022-06-15 at 12 36 59

My system CentOS 7 and docker version:

[root@k1 grafana-oncall]# docker --version
Docker version 20.10.17, build 100c701

[root@k1 grafana-oncall]# docker-compose --version
Docker Compose version v2.6.0

Thanks so much!

Request failed with status code 404

We have a Grafana instance (v8.5.0) running on an EC2 instance and is accessed through an internal application load balancer, and OIDC as the authentication.

We have set up OnCall in our EKS cluster and that is accessible through an internet-facing ALB. When attempting to curl from the Grafana instance, I get a response and I can see the associated message in the oncall pod logs. However, when attempting to connect the plugin (v1.0.3), it throws the following error:

Tried connecting to OnCall: https://<snip>
Error: Request failed with status code 404, retry or check settings & re-initialize.

The weird thing is, there are no associated messages in the pod logs for this alleged connection attempt. I thought it might be something to do with the ALB somehow, so I enabled load balancer logs and it doesn't even show any of these alleged connection attempts, which makes me think it isn't even hitting the load balancer for some reason. This is the only message shown in the Grafana logs:

logger=context traceID=00000000000000000000000000000000 userId=2 orgId=1 uname=<snip> t=2022-06-28T13:48:43.56+0000 lvl=info msg="Request Completed" method=POST path=/api/plugin-proxy/grafana-oncall-app/api/internal/v1/plugin/sync status=404 remote_addr=<snip> time_ms=5 duration=5.928139ms size=24 referer="https://<snip>/plugins/grafana-oncall-app?page=configuration" traceID=00000000000000000000000000000000

Not really sure what's going on here. Again, a curl from the instance works, so I know for a fact that from a network perspective it should work, so it just seems like the plugin isn't working for some reason.

Any ideas on how to resolve this? Please let me know if there is more debugging output to help resolve this.

Telegram do not work

I'm trying to connect Telegram but I receive this error:

Telegram, because the incident is not posted to Telegram (reason: Telegram is disabled for the route)

The user was successfully connect to the bot, there something that I need to configure?

Thanks

create silece for alert till next working hours

Lots of small p5 alerts can wait till next working hours.
As schedule owner i would like to setup some kind of working hours or time to debug and close low priority alerts. something like
notify mon: 9-18, tue: 9-18... fri: 9-18, sat, sun do not notify
and two option notify if still firing, notify anyway.

of cause low priority alerts lead to alert fatigue and must me eliminated but still popular case

Notifications for non-work phone and no cell tower signal

I have recently moved to a basement office, which has zero cell signal, so when I am on call I need to work from upstairs. No big deal. The obvious solution here is to connect Slack on my phone to the account for work, and don't complain about all of the notifications I need to look at.

However, I'm wondering if there could be a future feature that could enable a user such as myself, to receive notifications from an internet location, that is not a work-related account. Work-life balance and such. What do you think of this idea?

[Question] Installer for windows

Hi, just wanna ask if there is an installer for Grafana OnCall for windows like in .exe?
Like in grafana, we can download the .exe and just one click install to our OS.

Thank you.

Regards,
Henjoe

Issue with user to event association

Hey, first off I'm really excited by this tool especially given the potential for tight integration with the rest of Grafana!

However I'm unfortunately having some issues with the simplest possible set-up, running locally, that would prevent us from adopting this. Apologies if this is the wrong place to ask for help.

Given a schedule like the following:

image

And the following users in Grafana:

image

I would expect to see these users assigned to a schedule. However there seems to be something wrong with the association between event name and Grafana user:

image

I am running the setup from this docker-compose file, and only set it up today so the latest version of the docker image.

Thanks in advance for your help and feel free to direct me somewhere else if GH isn't the place.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.