openslo / slogen Goto Github PK
View Code? Open in Web Editor NEWtool to create and manage content for reliability tracking from logs/event data.
License: Apache License 2.0
tool to create and manage content for reliability tracking from logs/event data.
License: Apache License 2.0
I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.
When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.
I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/
Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.
The Monitors created don't have useful missing data alerts.
I'd like to be alerted if I screw up a query and the total count goes to 0 for x minutes.
Hi team, I've been using your tool extensively and I am loving it!
I have come across an issue with the monitoring of an SLO.
My current alerting configuration is as follows:
apiVersion: openslo/v1alpha
kind: SLO
metadata:
displayName: xxx
name: xxx
spec:
service: xxx
budgetingMethod: Occurrences
objectives:
- ratioMetrics:
total:
source: sumologic
queryType: Logs
query: |
xxx
good:
source: sumologic
queryType: Logs
query: 'xxx'
incremental: true
displayName: xxx
target: 0.99
alerts:
burnRate:
- shortWindow: '10m'
shortLimit: 14
longWindow: '1h'
longLimit: 14
notifications:
- connectionType: 'Email'
recipients:
- 'xxx@xxx'
triggerFor:
- Warning
- ResolvedWarning
- shortWindow: '30m'
shortLimit: 6
longWindow: '6h'
longLimit: 6
notifications:
- connectionType: 'Email'
recipients:
- 'xxx@xxx'
triggerFor:
- Warning
- ResolvedWarning
- shortWindow: '6h'
shortLimit: 1
longWindow: '24h'
longLimit: 1
notifications:
- connectionType: 'Email'
recipients:
- 'xxx@xxx'
triggerFor:
- Warning
- ResolvedWarning
When I evaluate the SLO over a 24h period, it is currently at 98.41897 (which is below the 99 to meet the SLO).
I would have expected that I would receive at least 1 email stating that this SLO is not being met, however all the monitors generated aren't being triggered.
I'm wondering if the calculation of one of these items may be incorrect?
Current version: There is no slogen command to output the version, but I'm pointing to the latest of the main branch.
For a monthly compliance with 90% SLO, if at end of first week there are 100K request then is total error budget 10K request failures or (30/7)100K(1-0.9)
Maybe we need a new term tentative error budget extrapolated from current traffic so far
workaround the issue that currently no way in UI to get connection id for specifying it in the alerting config
Purpose being to exclude the data in that duration from calculation of SLI and error budget consumed etc. as well burn rate alerts
currently only added to Global overview dashboards for top 4 common labels/fields
The "tf/dashboards/overview-.tf" file is constantly changing, each time I run "slogen . --apply --clean".
I believe this is because the map is unordered and the rows are being pulled out arbitrarily: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/overview.go#L193
This means that my terraform code changes on each run, rather than being hermetic.
This is a low priority, but it would be nice to have these rows sorted.
mark zone below 0% as red, and 50%-0% as orange
currently fixed to 30d.
will allow to track SLO for quarter, year etc
to be done via a separate yaml config or user uploaded lookup table
Top N slots where error budget was most affected
apps are made of services and apps might be part of a portfolio. For example, a bank may have online banking as a portfolio consisting of the following hierarchy:
mobile banking app -> consisting of account service and bill pay service
credit card app -> consisting of account service and payment service
In this example, SLO/SLIs defined at the service level would roll up to the corresponding app
Pre-requisite:
One or more SLOs have been defined for various services
Success Scenario:
Operations user (e.g. developer) picks which services should be grouped (ideally by consulting the service map)
System validates the following
the evaluation type is identical for chosen services (e.g periodic or aggregate but not mixing the two)
compliance period and type match
If the grouping are valid, SLI/SLO/error budget budgets are visualised
I've noticed that when I update an SLO, often the dashboards and graphs do not update with the new data for that SLO.
As an example, I'm currently writing latency SLOs as follows:
apiVersion: openslo/v1alpha
kind: SLO
metadata:
displayName: api Latency
name: api-ltc
spec:
service: account-opening
description: The amount of POST requests to /api that are faster than 2s
budgetingMethod: Occurrences
objectives:
- ratioMetrics:
total:
source: sumologic
queryType: Logs
query: |
_sourceCategory=/xxx
| kv "path", "elapsed_time"
| where path matches /api/
good:
source: sumologic
queryType: Logs
query: 'elapsed_time < 2000'
incremental: true
displayName: api calls that are faster than 2s
target: 0.90
Now, if I update the SLO to a new value of 1s, the graphs and data that populates the new dashboards is the same data that was from the original elapsed_time < 2000
.
I have a feeling this is because the data kept in a scheduled view is not deleted, but just disabled.
https://help.sumologic.com/Manage/Scheduled-Views/Pause-or-Disable-Scheduled-Views#disable-a-scheduled-view
Once disabled, no additional data can be indexed in a scheduled view. A disabled scheduled view is not technically deleted, but it can't be re-enabled. If you disable a view and later create a new view with the same name, you won't see duplicate results; instead all the data from both scheduled views are treated as one.
If this is true, I wonder if it's worth hashing the query and appending it to the scheduled view name?
Often when I switch an SLO, I expect it to stop firing and re-evaluate the data.
The current sumologic email connection is helpful, but l need to use the Webhook connection as well with the payload_override described in the terraform module.
This will allow me to add the custom template variables seen here: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/templates/terra/monitor.tf.gotf#L57.
From the release page for v1.0.1, I downloaded slogen_1.0.0_Linux_x86_64.tar.gz (strange that the version number is different!). And I attempted to run it:
$ ./slogen --help
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./slogen)
I'm not sure what the problem is here, but the previous release works fine on my system (HP Elitebook G5 running Ubuntu 18.04.6
Here's the release from https://github.com/OpenSLO/slogen/releases/tag/v1.0-beta :
$ slogen --help
Generates terraform files from openslo compatible yaml configs.
Generated terraform files can be used to configure SLO monitors, scheduled views & dashboards in sumo.
One or more config or directory containing configs can be given as arg. Doesn't supports regex/wildcards as input.
Usage:
slogen [paths to yaml config]... [flags]
slogen [command]
Examples:
slogen service/search.yaml
slogen ~/team-a/slo/ ~/team-b/slo ~/core/slo/login.yaml
slogen ~/team-a/slo/ -o team-a/tf
Available Commands:
completion Generate the autocompletion script for the specified shell
destroy destroy the content generated from the slogen command, equivalent to 'terraform destroy'
docs generate markdown documents of this tool in the specified path
help Help about any command
list utility command to get additional info about your sumo resources e.g.
new create a sample config from given profile
validate A brief description of your command
Flags:
-o, --out string output directory where to create the terraform files (default "tf")
-d, --dashboardFolder string root dashboard folder where to organise the dashboards per service (default "slogen-tf-dashboards")
-m, --monitorFolder string root monitor folder where to organise the monitors per service (default "slogen-tf-monitors")
-i, --ignoreErrors whether to continue validation even after encountering errors
-p, --plan show plan output after generating the terraform config
-a, --apply apply the generated terraform config as well
-c, --clean clean the old tf files for which openslo config were not found in the path args
--asModule whether to generate the terraform config as a module
--useViewHash whether to use descriptive or hashed name for the scheduled views, hashed names ensure data for old view is not used when the query for it changes
--onlyNative whether to generate only the native slo resources
--sloFolder string root slo folder where to organise native slos by service (default "slogen")
--sloMonitorFolder string root monitor folder where to organise native slo monitors by service (default "slogen")
-h, --help help for slogen
Use "slogen [command] --help" for more information about a command.
And here's my machine info:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ uname -a
Linux SunilChop2019-Lunix 4.15.0-194-generic #205-Ubuntu SMP Fri Sep 16 19:49:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Timeslices method uses a ratio of good time slices vs. total time slices in a budgeting period.
currently its in form where ("{{task}}"="*" or task="{{task}}")
, needs to be | where ("{{task}}"="*" or task matches "{{task}}")
instead so that wildcards also used while filtering.
All of the dashboards created by slogen are "This Month", and since its the 1st of March, most of the data is only showing for 1 day.
I'd like the dashboards to be the last 30 days rolling instead so that I can better assess if the SLO is being met.
e.g. if customerID
is specified as field to aggregate SLI data on then also allow to create monitor on them such that alerts can be triggered if errorBudget for even a single customer goes below 0%
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.