openslo / slogen Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 5.0 2.8 MB

tool to create and manage content for reliability tracking from logs/event data.

License: Apache License 2.0

Go 98.07% Makefile 0.08% HCL 1.49% Shell 0.36%

command-line-tool golang openslo reliability slo sumologic terraform

slogen's People

Contributors

Stargazers

Watchers

Forkers

kumoroku arunsechergy tzmvp arpitjain305 patkaehuaea

slogen's Issues

Burnrate alerts aren't working correctly

I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.

When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.

I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/

Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.

monthly SLO report generator

Explore gaps in metric query language to do SLO tracking on them

add documentation for using the tool behind a proxy

detailed documentation on configuring multi-window multi-burn alerts

OOB samples/templates for standard services

Based on terraform-sumologic-sumo-logic-monitor//monitor_packages

Missing Data Alerts

The Monitors created don't have useful missing data alerts.

I'd like to be alerted if I screw up a query and the total count goes to 0 for x minutes.

SLO Burn rate monitoring is incorrect

Hi team, I've been using your tool extensively and I am loving it!

I have come across an issue with the monitoring of an SLO.

My current alerting configuration is as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: xxx
  name: xxx
spec:
  service: xxx
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            xxx
        good: 
          source: sumologic
          queryType: Logs
          query: 'xxx'
        incremental: true
      displayName: xxx
      target: 0.99
alerts:
  burnRate:
    - shortWindow: '10m'
      shortLimit: 14
      longWindow: '1h'
      longLimit: 14
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '30m'
      shortLimit: 6
      longWindow: '6h'
      longLimit: 6
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '6h'
      shortLimit: 1
      longWindow: '24h'
      longLimit: 1
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning

When I evaluate the SLO over a 24h period, it is currently at 98.41897 (which is below the 99 to meet the SLO).

I would have expected that I would receive at least 1 email stating that this SLO is not being met, however all the monitors generated aren't being triggered.

I'm wondering if the calculation of one of these items may be incorrect?

Current version: There is no slogen command to output the version, but I'm pointing to the latest of the main branch.

Revisit error budget calculation for occurrences based budgeting

For a monthly compliance with 90% SLO, if at end of first week there are 100K request then is total error budget 10K request failures or (30/7)100K(1-0.9)

Maybe we need a new term tentative error budget extrapolated from current traffic so far

allow timezone param to set in yaml for monitor notifications

add subcommand to get list of monitor connection with ids

workaround the issue that currently no way in UI to get connection id for specifying it in the alerting config

support for periodic maintenance windows

Purpose being to exclude the data in that duration from calculation of SLI and error budget consumed etc. as well burn rate alerts

dashboard variable for fields specified for each SLO dashboard

currently only added to Global overview dashboards for top 4 common labels/fields

The "Overview Dashboard" terraform code isn't static

The "tf/dashboards/overview-.tf" file is constantly changing, each time I run "slogen . --apply --clean".

I believe this is because the map is unordered and the rows are being pulled out arbitrarily: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/overview.go#L193

This means that my terraform code changes on each run, rather than being hermetic.

This is a low priority, but it would be nice to have these rows sorted.

conditional formatting of error budget remaining chart

mark zone below 0% as red, and 50%-0% as orange

samples for advance use cases of SLO

min,max etc
bracketed as used by search team

"resource not found" when changing the service for existing config

expand spec to specify retention period

currently fixed to 30d.
will allow to track SLO for quarter, year etc

SLO breakdown panel by fields specified in the config

update documentation and add change log for v0.7

support for dynamic specification of unplanned maintenance windows

to be done via a separate yaml config or user uploaded lookup table

Hierarchal SLO rollup/aggregation

advance version of #6 so as to provide insights for various service/product groups, business unit etc.

OpenSLO discussion on this : #6

add forecasted values to end of month in the panel for SLO/budget remaining

Panel : Worst hours with respect to error budget depletion

Top N slots where error budget was most affected

aggregated overview dashboards at service level

apps are made of services and apps might be part of a portfolio. For example, a bank may have online banking as a portfolio consisting of the following hierarchy:

mobile banking app -> consisting of account service and bill pay service
credit card app -> consisting of account service and payment service
In this example, SLO/SLIs defined at the service level would roll up to the corresponding app

Pre-requisite:

One or more SLOs have been defined for various services
Success Scenario:

Operations user (e.g. developer) picks which services should be grouped (ideally by consulting the service map)
System validates the following
the evaluation type is identical for chosen services (e.g periodic or aggregate but not mixing the two)
compliance period and type match
If the grouping are valid, SLI/SLO/error budget budgets are visualised

Changing SLOs causing issues with scheduled views

I've noticed that when I update an SLO, often the dashboards and graphs do not update with the new data for that SLO.

As an example, I'm currently writing latency SLOs as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: api Latency
  name: api-ltc
spec:
  service: account-opening
  description: The amount of POST requests to /api that are faster than 2s
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            _sourceCategory=/xxx
            | kv "path", "elapsed_time"
            | where path matches /api/
        good: 
          source: sumologic
          queryType: Logs
          query: 'elapsed_time < 2000'
        incremental: true
      displayName: api calls that are faster than 2s
      target: 0.90

Now, if I update the SLO to a new value of 1s, the graphs and data that populates the new dashboards is the same data that was from the original elapsed_time < 2000.

I have a feeling this is because the data kept in a scheduled view is not deleted, but just disabled.
https://help.sumologic.com/Manage/Scheduled-Views/Pause-or-Disable-Scheduled-Views#disable-a-scheduled-view

Once disabled, no additional data can be indexed in a scheduled view. A disabled scheduled view is not technically deleted, but it can't be re-enabled. If you disable a view and later create a new view with the same name, you won't see duplicate results; instead all the data from both scheduled views are treated as one.

If this is true, I wonder if it's worth hashing the query and appending it to the scheduled view name?

Often when I switch an SLO, I expect it to stop firing and re-evaluate the data.

Support for other Sumologic connections

The current sumologic email connection is helpful, but l need to use the Webhook connection as well with the payload_override described in the terraform module.

This will allow me to add the custom template variables seen here: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/templates/terra/monitor.tf.gotf#L57.

pass proxy config to terraform session

add examples for alert notifications on webhook

binary for linux x86_64 does not work

From the release page for v1.0.1, I downloaded slogen_1.0.0_Linux_x86_64.tar.gz (strange that the version number is different!). And I attempted to run it:

$ ./slogen --help
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./slogen)

I'm not sure what the problem is here, but the previous release works fine on my system (HP Elitebook G5 running Ubuntu 18.04.6

Here's the release from https://github.com/OpenSLO/slogen/releases/tag/v1.0-beta :

$ slogen --help

Generates terraform files from openslo compatible yaml configs. 
Generated terraform files can be used to configure SLO monitors, scheduled views & dashboards in sumo.
One or more config or directory containing configs can be given as arg. Doesn't supports regex/wildcards as input.

Usage:
  slogen [paths to yaml config]... [flags]
  slogen [command]

Examples:
slogen service/search.yaml 
slogen ~/team-a/slo/ ~/team-b/slo ~/core/slo/login.yaml
slogen ~/team-a/slo/ -o team-a/tf


Available Commands:
  completion  Generate the autocompletion script for the specified shell
  destroy     destroy the content generated from the slogen command, equivalent to 'terraform destroy'
  docs        generate markdown documents of this tool in the specified path
  help        Help about any command
  list        utility command to get additional info about your sumo resources e.g. 
  new         create a sample config from given profile
  validate    A brief description of your command

Flags:
  -o, --out string                output directory where to create the terraform files (default "tf")
  -d, --dashboardFolder string    root dashboard folder where to organise the dashboards per service (default "slogen-tf-dashboards")
  -m, --monitorFolder string      root monitor folder where to organise the monitors per service (default "slogen-tf-monitors")
  -i, --ignoreErrors              whether to continue validation even after encountering errors
  -p, --plan                      show plan output after generating the terraform config
  -a, --apply                     apply the generated terraform config as well
  -c, --clean                     clean the old tf files for which openslo config were not found in the path args
      --asModule                  whether to generate the terraform config as a module
      --useViewHash               whether to use descriptive or hashed name for the scheduled views, hashed names ensure data for old view is not used when the query for it changes
      --onlyNative                whether to generate only the native slo resources
      --sloFolder string          root slo folder where to organise native slos by service (default "slogen")
      --sloMonitorFolder string   root monitor folder where to organise native slo monitors by service (default "slogen")
  -h, --help                      help for slogen

Use "slogen [command] --help" for more information about a command.

And here's my machine info:

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

$ uname -a
Linux SunilChop2019-Lunix 4.15.0-194-generic #205-Ubuntu SMP Fri Sep 16 19:49:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux