thaum-xyz / ankhmorpork Goto Github PK

View Code? Open in Web Editor NEW

71.0 5.0 11.0 7.82 MB

@paulfantom's GitOps managed kube cluster running in a cupboard. Built with fancy tools :sparkles:

Home Page: https://ankhmorpork.thaum.xyz

License: MIT License

Shell 5.29% Jsonnet 88.12% Makefile 1.69% Jinja 1.90% Groovy 2.71% HCL 0.30%

cluster k3s-cluster prometheus prometheus-operator fluxcd kubernetes ansible jsonnet

ankhmorpork's Introduction

Ankhmorpork

📖 Overview

This is a mono repository for @paulfantom home infrastructure and Kubernetes cluster. Project utilizes Infrastructure as Code to automate provisioning, operating, and updating self-hosted services.

⛵ Kubernetes

Installation

Cluster is k3s provisioned on bare-metal hosts with latest LTS Ubuntu OS using a modified version of Ansible role provided by k3s project.

🔸 Click here to see my Ansible playbooks and roles.

Components

Logo	Name	Description
	Jsonnet	Data templating language
	GitHub Actions	CI system
	Ansible	Automate bare metal provisioning and configuration
	Ubuntu	Base OS for Kubernetes nodes
	K3s	Lightweight distribution of Kubernetes
	Kubernetes	Container-orchestration system, the backbone of this project
	kured	Kubernetes Reboot Daemon
	TopoLVM	Local storage based on LVM
	Longhorn	Distributed block storage
	Minio	S3 storage
	Flux	GitOps tool built to deploy applications to Kubernetes
	ExternalSecrets	Secrets and encryption management system
	MetalLB	Bare metal load-balancer for Kubernetes
	cert-manager	Cloud native certificate management
	Cloudflare	DNS
	Traefik	Kubernetes Ingress Controller
	oauth2-proxy	Authentication proxy
	Prometheus	Systems monitoring and alerting toolkit
	Thanos	Metrics datalake
	Grafana	Operational dashboards
	Cloudnative-pg	Postgres Controller
	Homer	Portal Site
	HomeAssistant	Home Automation System
	ESPhome	Microcontrollers Management
	Tandoor	Cookbook
	Photoprism	Photo Management
	Paperless-ngx	Document Management
AND	MANY	OTHERS

GitOps

Flux watches manifests/ subdirectories in base and apps top-level directories and makes changes based on YAML manifests. Where possible YAML manifests are generated from jsonnet code.

🌐 DNS

Ingress Controller

Over WAN, I have port-forwarded ports 80 and 443 to the load balancer IP of my ingress controller that's running in my Kubernetes cluster.

Internal DNS

CoreDNS is deployed in a cluster and provides an internal resolution of ingress addresses as well as a proxy to NextDNS used for AdBlocking.

Dynamic DNS

My home IP can change at any given time and in order to keep my WAN IP address up to date on Cloudflare I have configured DDNS on Unifi Dream Machine Pro.

💽 Network Attached Storage

QNAP NAS TS-431DeU is used to manage NFS shares and backup them to B2 cloud using HBS.

🔧 Hardware

Device	Count	RAM	Storage	Connectivity	Purpose
Unifi Dream Machine Pro	1	N/A	N/A	8x GbE + 2xSFP+	Router
Unifi US-16-PoE switch	1	N/A	N/A	16x GbE + 2xSFP	Main Switch
QNAP TS-431DeU	1	16GB	2x240GB NVMe RAID1 + 4x3TB RAID5	2x 2.5GbE LACP	NAS
HP EliteDesk G2 800 mini	2	32GB	240GB M2 SSD + 500GB SSD	1x GbE	K3S Node
DELL E5440 Laptop	1	12GB	240 SSD + 2x 120GB SSD	1x GbE	K3S Node
Custom-built Server	1	64GB	240GB NVMe + 1TB SSD	2x GbE LACP + 1GbE	K3S Node w/GPU

✨ Features

Project status: Alpha

🤝 Contributing

Any contributions you make, either big or small, are greatly appreciated.

🔏 Security

If you find any security issue please ping me using one of following contact mediums:

twitter DM (@paulfantom)
kubernetes slack (@paulfantom)
freenode IRC (@paulfantom)
email ([email protected])

🏛️ License

Distributed under the MIT License. See LICENSE for more information.

ankhmorpork's People

Contributors

Stargazers

Watchers

Forkers

jackmagictavern avinashdhinwa vignesh-v-nutanix diegombeltran ibelitei roya andoriyu majusmisiak liyongxian gabrielbatir lintangzupan

ankhmorpork's Issues

PAU-45: Cloudflare and letsencrypt certs

Currently, certs for alchemyof.it cannot be requested when cloudflare is configured in strict TLS mode. Investigate adding cloudflare cert issuer to cert-manager or reconfiguring cloudflare rules.

Alert: TargetDown in monitoring

Alert TargetDown firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2022-12-22 13:39:48.970753442 +0000 UTC m=+327629.299195271.

Common Labels

alertname	TargetDown
cluster	ankhmorpork
job	kubelet
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	25% of the kubelet/ targets in monitoring namespace are down.
runbook_url	https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
summary	One or more targets are unreachable.

Alerts

StartsAt	Links
2022-12-22 13:24:18.612 +0000 UTC	GeneratorURL

(DO NOT MODIFY: c3058ff715a6bb277bc9d4d65713d6bde6d43ea2e64facf2cee7f953c750f083 )

Setup iSCSI provisioner

In addition to NFS provisioner it would be beneficial to setup iSCSI provisioner - https://github.com/kubernetes-incubator/external-storage/tree/master/iscsi/targetd

Majority of applications using hostPath for volumes could be moved to use iSCSI PVC.

Action items:

Ansible setup for targetd server (on NAS)
Ansible setup for iSCSI initiators (everywhere)
Manifests for iSCSI provisioner
~~2 storage classes - one for vg_fast and one for vg_storage. A former one could be used for testing.~~ Only vg_fast is allowed to be accessible for kubernetes. vg_storage is full and allowed to be accessed only as hostPath for performance reasons.

Use coredns customization provided by k3s

K3S offers a way to customize coredns responsible for cluster DNS setup. It would be nice to use it for a split-DNS situation.

Lead - k3s-io/k3s#4397

[ALERT] alertname:Test instance:localhost:9090 job:test

(Updated at 2022-12-18 17:00:29.598843205 +0000 UTC m=+588.227325790)

Common Labels

alertname	Test
instance	localhost:9090
job	test

Common Annotations

description	some description
runbook_url	https://runbooks.thaum.xyz

Alerts

thing	hint	StartsAt	Links
value		2022-06-12 01:00:00 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 933b432e797b2d35572313028cb4a685383488c087c4aea733e81833860129d5 )

Automate copying unifi backups

Unifi controller creates backup every week. However, this backup is stored locally on controller itself. It would be beneficial to create a cronjob to copy backup files and store on nfs-backed PV.
Backup files are stored in /data/autobackup/ on unifi controller.

Next step would be to add backup job to send data from PV to Backblaze.

Alert: KubeDaemonSetRolloutStuck in monitoring

Alert KubeDaemonSetRolloutStuck firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2024-02-10 11:32:40.199340918 +0000 UTC m=+44472.207561420.

Common Labels

alertname	KubeDaemonSetRolloutStuck
cluster	ankhmorpork
container	kube-rbac-proxy-main
daemonset	node-exporter
instance	10.42.2.209:8443
job	kube-state-metrics
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	DaemonSet monitoring/node-exporter has not finished or progressed for at least 15 minutes.
runbook_url	https://runbooks.thaum.xyz/runbooks/kubernetes/kubedaemonsetrolloutstuck
summary	DaemonSet rollout is stuck.

Alerts

StartsAt	Links
2024-02-10 06:12:09.821 +0000 UTC	GeneratorURL

Automate provisioning of tools installed manually

Currently, there are many manually deployed services on NAS. This needs to be adjusted and config should be managed via git.

Discovered services:

ddclient
~~iscsi targetd~~ not installed, issue tracked in #2
NFS exports

Configuration done on install (may be possible to automate via kickstart):

~~teamd config for LACP teamed network devices~~ done during system installation
~~mdadm config~~ done during installation
~~LVM config~~ done during installation
~~Static DNS configuration (main DNS server for other clients is running in cluster)~~ done during installation

Alert: KubeDaemonSetMisScheduled in monitoring

Alert KubeDaemonSetMisScheduled firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2022-12-22 13:39:43.58186847 +0000 UTC m=+327623.910310279.

Common Labels

alertname	KubeDaemonSetMisScheduled
cluster	ankhmorpork
container	kube-rbac-proxy-main
instance	10.42.6.45:8443
job	kube-state-metrics
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

runbook_url	https://runbooks.thaum.xyz/runbooks/kubernetes/kubedaemonsetmisscheduled
summary	DaemonSet pods are misscheduled.

Alerts

daemonset	description	StartsAt	Links
kured		2022-12-22 13:29:43.046 +0000 UTC	GeneratorURL
speaker		2022-12-22 13:29:43.046 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 7d04718e7ad227b0d6c6089159e34b440211567712f5bb11c4fc412328114d13 )

Alert: KubePodCrashLooping in monitoring

Alert KubePodCrashLooping firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2024-04-02 20:41:58.294093185 +0000 UTC m=+20.802279370.

Common Labels

alertname	KubePodCrashLooping
cluster	ankhmorpork
container	github-receiver
instance	10.42.6.102:8443
job	kube-state-metrics
namespace	monitoring
pod	github-receiver-668799f6b4-nbspf
prometheus	monitoring/k8s
reason	CrashLoopBackOff
severity	warning
uid	53a4337e-3b09-4ff1-85c3-0e11fcd5d41e

Common Annotations

description	Pod monitoring/github-receiver-668799f6b4-nbspf (github-receiver) is in waiting state (reason: "CrashLoopBackOff").
runbook_url	https://runbooks.thaum.xyz/runbooks/kubernetes/kubepodcrashlooping
summary	Pod is crash looping.

Alerts

StartsAt	Links
2024-04-02 20:36:09.821 +0000 UTC	GeneratorURL

Alert: PostgreSQLHighConnections in paperless

Alert PostgreSQLHighConnections firing in paperless namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-04-25 20:19:26.626114387 +0000 UTC m=+82818.323127310.

Common Labels

alertname	PostgreSQLHighConnections
cluster	ankhmorpork
instance	10.42.1.244:9187
namespace	paperless
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	10.42.1.244:9187 is exceeding 80% of the currently configured maximum Postgres connection limit (current value: 22s). Please check utilization graphs and confirm if this is normal service growth, abuse or an otherwise temporary condition or if new resources need to be provisioned (or the limits increased, which is mostly likely).
runbook_url	https://runbooks.thaum.xyz/runbooks/postgresql/postgresqlhighconnections
summary	10.42.1.244:9187 is over 80% of max Postgres connections.

Alerts

StartsAt	Links
2023-04-25 20:13:55.947 +0000 UTC	GeneratorURL

(DO NOT MODIFY: d5f17e654e53aba5a290a7b5bd6083f74ceac73c01297e6d6edaf7b22b53a627 )

Alert: TestAlert in monitoring

Alert TestAlert firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2022-12-18 18:39:46.271423238 +0000 UTC m=+26.599865047.

Common Labels

alertname	Test1
instance	localhost:9090
job	test

Common Annotations

description	some description
runbook_url	https://runbooks.thaum.xyz

Alerts

thing	hint	StartsAt	Links
value		2022-06-12 01:00:00 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 346585aa11eea0e4b4d76f6e02fbcb6fd3e072e0044f467030c7eaa1440195b6 )

nextcloud: migrate to postgres

Main point behind migrating to postgresql:

it is faster (source)
mysql doesn't have good monitoring coverage in prometheus ecosystem (lack of meaningful alerts). Postgresql, on the other hand, is used by gitlab and has already established alerts and runbooks (alerts, recording rules)
possibly no hacks required to run postgresql_exporter

Cons:

there might be issues with some nextcloud apps not working with postgre as backend (more in nextcloud/server#5912 (comment) and nextcloud/twofactor_admin#35)

Alert: TargetDown in monitoring

Alert TargetDown firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2024-04-14 20:39:22.674050872 +0000 UTC m=+214945.203274758.

Common Labels

alertname	TargetDown
cluster	ankhmorpork
job	probe/monitoring/uptimerobot
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	100% of the probe/monitoring/uptimerobot/ targets in monitoring namespace are down.
runbook_url	https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
summary	One or more targets are unreachable.

Alerts

StartsAt	Links
2024-04-14 20:33:22.417 +0000 UTC	GeneratorURL

Alert: PostgreSQLCacheHitRatio in homeassistant

Alert PostgreSQLCacheHitRatio firing in homeassistant namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-01-17 13:36:14.31084953 +0000 UTC m=+370286.411825553.

Common Labels

alertname	PostgreSQLCacheHitRatio
cluster	ankhmorpork
datname	homeassistant
namespace	homeassistant
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	PostgreSQL low on cache hit rate on for database homeassistant with a value of 0.5173276395749896
runbook_url	https://runbooks.thaum.xyz/runbooks/postgresql/postgresqlcachehitratio
summary	PostgreSQL low cache hit rate on for database homeassistant

Alerts

StartsAt	Links
2023-01-17 13:30:43.508 +0000 UTC	GeneratorURL

(DO NOT MODIFY: b0cb3dbb66ae7f9be3e3a19466b1c79090cff7825d7e5292013a2f97e4dc16e4 )

Alert: TargetDown in monitoring

Alert TargetDown firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-03-07 15:36:04.518855456 +0000 UTC m=+5316.354817277.

Common Labels

alertname	TargetDown
cluster	ankhmorpork
job	node-exporter
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	20% of the node-exporter/ targets in monitoring namespace are down.
runbook_url	https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
summary	One or more targets are unreachable.

Alerts

StartsAt	Links
2023-03-07 15:34:48.612 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 450f142a6083f4ed63d0d1d7de01ebeb9d33d15312095933d5f927bb4e076c3f )

Run ansible in k3s pod

Due to resource constraints on master01, ansible deployment cannot finish and causes k3s apiserver to crash. To mitigate issue it would be better to run ansible as a cronjob in k3s.

DoD:

find/create a container image with:
- ansible
- ssh client
- git
crojob should run ./deploy.sh script
repository should be mounted as PV on NFS storage class
pushgateway shouldn't be exposed to local network anymore
ansible_connection=local cannot be set for any host

Convert docs dir to hosted site

docs/ directory could be hosted externally under docs.ankhmorpork.thaum.xyz. This can be done by using pattern from https://github.com/khuedoan/homelab/tree/master/docs

Bring back SiteExternallyDown alert

Reduce severity level of SiteDown alert to warning
Recreate SiteExternallyDown alert that uses only uptimerobot data
Introduce inhibition rule - if SiteExternallyDown is firing, don't send SiteDown

Checkout litestream to replicate SQLite

Checkout if https://github.com/benbjohnson/litestream can be used to increase the robustness of applications using SQLite underneath

Alert: Test in monitoring

Alertmanager URL: http://localhost:9093

firing

Labels:
- thing = value
Annotations:
- hint = how to fix foobar

TODO: add graph url from annotations.

Replace heimdall with organizr v2

organizr v2 might be a better fit for main portal site.

DynDNS setup not working

Investigate why UDM Pro did not update DNS entry.

Alert: FoobarIsBroken in <no value>

Alertmanager URL: http://localhost:9093

firing

Labels:
- alertname = Test
- namespace = monitoring
- thing = value
Annotations:
- hint = how to fix foobar

TODO: add graph url from annotations.

Alert: TargetDown in monitoring

Alert TargetDown firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-03-07 14:39:19.103654196 +0000 UTC m=+1910.939616008.

Common Labels

alertname	TargetDown
cluster	ankhmorpork
job	monitoring/smokeping
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	33.33% of the monitoring/smokeping/ targets in monitoring namespace are down.
runbook_url	https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
summary	One or more targets are unreachable.

Alerts

StartsAt	Links
2023-03-07 14:23:48.612 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 0f273fe6719899752f139b038340c40a5684ab5eda17b89da23492695f528c41 )

Change backup strategy

Investigate k8s-native backup solutions to swap the current custom backup solution to a generic one.

Requirements:

uses restic internally
allow sending backups to Backblaze
backups need to be encrypted
allow backing up any PV

Nice to have:

allow backing up /var/lib/rancher/k3s/server on master node to allow master node recovery

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore(deps): update base infrastructure (devsec.hardening, k3s-io/k3s, prometheus.prometheus)
chore(deps): update dependency golang to v1.22.2

Detected dependencies

ansible-galaxy

metal/roles/requirements.yml

devsec.hardening 9.0.0

prometheus.prometheus 0.6.0

oefenweb.locales v1.0.52

hifis.unattended_upgrades v3.2.1

github-actions

.github/workflows/kubeconform.yml

actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11

actions/setup-go v5

actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11

actions/setup-go v5

.github/workflows/kubescape.yml

actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11

.github/workflows/prometheusrule.yml

actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11

actions/setup-go v5

prymitive/pint-action v1

.github/workflows/versions.yaml

actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11

actions/setup-go v5

juliangruber/read-file-action v1

peter-evans/create-pull-request v6

helm-values

apps/external-dns/values.yaml

ghcr.io/muhlba91/external-dns-provider-adguard v5.0.0

jsonnet-bundler

apps/datalake-metrics/jsonnet/jsonnetfile.json

apps/monitoring/jsonnet/jsonnetfile.json

apps/parca/jsonnet/jsonnetfile.json

apps/system-update/jsonnet/jsonnetfile.json

base/flux-system/jsonnet/jsonnetfile.json

lib/jsonnet/apps/jsonnetfile.json

regex

metal/group_vars/k3s.yml

k3s-io/k3s v1.28.6+k3s1

.github/workflows/versions.yaml

google/jsonnet v0.20.0

.github/workflows/kubeconform.yml

golang 1.21.5

.github/workflows/prometheusrule.yml

golang 1.21.5

.github/workflows/versions.yaml

golang 1.21.5

Check this box to trigger a request for Renovate to run again on this repository

Alert: PostgreSQLCacheHitRatio in <no value>

Alert PostgreSQLCacheHitRatio firing in namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2022-12-19 02:13:13.971634309 +0000 UTC m=+27234.300076138.

Common Labels

alertname	PostgreSQLCacheHitRatio
cluster	ankhmorpork
datname	homeassistant
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	PostgreSQL low on cache hit rate on for database homeassistant with a value of 0.9593283923506267
runbook_url	https://runbooks.thaum.xyz/runbooks/postgresql/postgresqlcachehitratio
summary	PostgreSQL low cache hit rate on for database homeassistant

Alerts

StartsAt	Links
2022-12-19 02:07:43.508 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 5ef6b1689008f5f35e66c52c08ad18997835c6f7116a07f925d42ce9d44bf43c )

Re-enable redis in nextcloud

Redis cache was disabled after a recent outage and disaster recovery. It needs to be reenabled and improved by adding password protection (REDIS_PASSWORD variable in nextcloud).

Unfortunately due to how nextcloud docker container is constructed, this is a manual work on nextcloud side and needs to be done by editing config.php file as well as setting correct variables for nextcloud container.

Prometheus Probe CRD doesn't probe targets

I am using Prometheus Probe CRD and Blackbox exporter to scrape static targets. But, when I checked in Blackbox exporter, I don't see specified targets being probed at all.

I was able to probe targets using Blackbox exporter and additionalScrapeConfigs in values file of Prometheus exporter but it doesn't work with Probe CRD.

Here is my Probe custom object config,

kind: Probe
metadata:
  name: probe-crd
  namespace: prometheus
spec:
  jobName: probe-crd
  prober:
    url: prometheus-blackbox-exporter:9115
  targets:
    staticConfig:
      static:
      - https://www.google.com

Blackbox exporter service is running on port 9115. Can someone please let me know what I am missing here?

Alert: ThanosQueryGrpcClientErrorRate in datalake-metrics

Alert ThanosQueryGrpcClientErrorRate firing in datalake-metrics namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2024-03-22 02:21:45.539869555 +0000 UTC m=+24.538879833.

Common Labels

alertname	ThanosQueryGrpcClientErrorRate
cluster	ankhmorpork
job	thanos-query
namespace	datalake-metrics
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	Thanos Query thanos-query is failing to send 5.817% of requests.
runbook_url	https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate
summary	Thanos Query is failing to send requests.

Alerts

StartsAt	Links
2024-03-22 02:15:36.048 +0000 UTC	GeneratorURL

Runbooks and alerts

List of collections of alerts/recording rules and runbooks which might be useful for the cluster:

https://gitlab.com/gitlab-com/runbooks/-/tree/master
~~https://github.com/samber/awesome-prometheus-alerts~~ doesn't have anything more meaningful than what is already sourced from other projects

CI setup

Tools to run in CI:

https://github.com/FairwindsOps/pluto - added in #16
kubeval - added in 011b7b2

Alert: PostgreSQLMaxConnectionsReached in paperless

Alert PostgreSQLMaxConnectionsReached firing in paperless namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-04-28 15:13:57.585697289 +0000 UTC m=+22567.696417215.

Common Labels

alertname	PostgreSQLMaxConnectionsReached
cluster	ankhmorpork
instance	10.42.0.74:9187
namespace	paperless
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	10.42.0.74:9187 is exceeding the currently configured maximum Postgres connection limit (current value: 22s). Services may be degraded - please take immediate action (you probably need to increase max_connections in the Docker image and re-deploy.
runbook_url	https://runbooks.thaum.xyz/runbooks/postgresql/postgresqlmaxconnectionsreached
summary	10.42.0.74:9187 has maxed out Postgres connections.

Alerts

StartsAt	Links
2023-04-28 15:03:25.947 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 9af083c4c04aefaa4b687f93fe36767e0219bc14a37a0deb4214a22c8f6ab837 )

Alert: ThanosStoreObjstoreOperationLatencyHigh in datalake-metrics

Alert ThanosStoreObjstoreOperationLatencyHigh firing in datalake-metrics namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-12-04 05:27:59.510352973 +0000 UTC m=+495.481901453.

Common Labels

alertname	ThanosStoreObjstoreOperationLatencyHigh
cluster	ankhmorpork
job	thanos-store
namespace	datalake-metrics
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	Thanos Store thanos-store Bucket has a 99th percentile latency of 2.742787470798822 seconds for the bucket operations.
runbook_url	https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreobjstoreoperationlatencyhigh
summary	Thanos Store is having high latency for bucket operations.

Alerts

StartsAt	Links
2023-12-04 05:22:29.197 +0000 UTC	GeneratorURL

Alert: ReconciliationFailure in flux-system

Alert ReconciliationFailure firing in flux-system namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2024-04-15 17:13:33.601854197 +0000 UTC m=+288996.131078121.

Common Labels

alertname	ReconciliationFailure
cluster	ankhmorpork
kind	Kustomization
name	shlink
namespace	flux-system
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	Kustomization flux-system/shlink reconciliation has been failing for more than 10 minutes.
summary	Flux objects reconciliation failure

Alerts

StartsAt	Links
2024-04-11 16:14:32.63 +0000 UTC	GeneratorURL

Migrate nfs-client provisioner

Currently, the used version is outdated and needs to be updated to https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

restric RBAC permissions for configmapsecrets controller

Seems like permissions for ConfigMapSecret controller are too broad as discussed in https://kubernetes.slack.com/archives/CFFDS2Z7F/p1594757903351300

PAU-59: Enable BGP in metallb

Enable BGP in Unifi USG
Switch metallb to use BGP-mode instead of ARP-mode

Instructions: https://community.ui.com/questions/BGP-instructions-for-USG-K8s-MetalLB/b61e2f67-34f2-4571-9140-8d6b9cde2d72

Monitoring mixins to apply/create

PAU-63: reenable-transmission-exporter

It was disabled in 92b4e47

re-enable adblocker

Recent code update broke down DNS and adblocker plugin cannot contact external sources to pull blocklists.

Example errors:

[WARNING] plugin/ads: Loading list from url "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts" failed with error: Get "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts": net/http: TLS handshake timeout
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 www.google.de. A: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[WARNING] plugin/ads: Loading list from url "https://mirror1.malwaredomains.com/files/justdomains" failed with error: Get "https://mirror1.malwaredomains.com/files/justdomains": dial tcp 139.146.167.17:443: i/o timeout
[ERROR] plugin/errors: 2 imap.gmail.com. AAAA: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out

Alert: Test in monitoring

Alert Test firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2022-12-18 18:18:19.43869783 +0000 UTC m=+24.267089034.

Common Labels

alertname	Test1
instance	localhost:9090
job	test

Common Annotations

<td>some description</td>

<td><a href="https://runbooks.thaum.xyz">https://runbooks.thaum.xyz</a></td>

description
runbook_url

Alerts

thing	hint	StartsAt	Links
value		2022-06-12 01:00:00 +0000 UTC	GeneratorURL

(DO NOT MODIFY: 21a57638781bc59a1dbd9cc17208d8dd0867816312710c3eaf12f759182010ee )

one server type to learn
native prometheus metrics exposition (instead of faulty eko/pihole-exporter) with proper metrics :)
DNS-over-HTTP support
full GitOps management
stateless and easily scalable

Cons:

No webUI to quickly whitelist sites
ads plugin is not included natively in CoreDNS

Alert: KubeDeploymentReplicasMismatch in monitoring

Alert KubeDeploymentReplicasMismatch firing in monitoring namespace

This is an automated issue created by the monitoring system. Please do not edit this message.

Alertmanager URL: https://alertmanager.ankhmorpork.thaum.xyz

Issue was last updated at 2023-12-06 18:30:40.223380231 +0000 UTC m=+1278.197167614.

Common Labels

alertname	KubeDeploymentReplicasMismatch
cluster	ankhmorpork
container	kube-rbac-proxy-main
deployment	grafana
instance	10.42.6.139:8443
job	kube-state-metrics
namespace	monitoring
prometheus	monitoring/k8s
severity	warning

Common Annotations

description	Deployment monitoring/grafana has not matched the expected number of replicas for longer than 15 minutes.
runbook_url	https://runbooks.thaum.xyz/runbooks/kubernetes/kubedeploymentreplicasmismatch
summary	Deployment has not matched the expected number of replicas.

Alerts

StartsAt	Links
2023-12-06 18:25:09.821 +0000 UTC	GeneratorURL

Setup a knowledge base aka wiki

Deploy a knowledgebase application like bookstack at wiki.thaum.xyz or wiki.ankhmorpork.thaum.xyz

Alternative would be to use some sort of SaaS offering or anything from https://github.com/awesome-selfhosted/awesome-selfhosted#wikis

Fix issues with mysqld-exporter permissions

time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.perf_schema.eventsstatements: Error 1142: SELECT command denied to user 'cloud'@'127.0.0.1' for table 'events_statements_summary_by_digest'" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.perf_schema.indexiowaits: Error 1142: SELECT command denied to user 'cloud'@'127.0.0.1' for table 'table_io_waits_summary_by_index_usage'" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.perf_schema.tableiowaits: Error 1142: SELECT command denied to user 'cloud'@'127.0.0.1' for table 'table_io_waits_summary_by_table'" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.info_schema.innodb_metrics: Error 1227: Access denied; you need (at least one of) the PROCESS privilege(s) for this operation" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.info_schema.innodb_cmp: Error 1227: Access denied; you need (at least one of) the PROCESS privilege(s) for this operation" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.info_schema.innodb_cmpmem: Error 1227: Access denied; you need (at least one of) the PROCESS privilege(s) for this operation" source="exporter.go:171"
time="2020-04-16T09:33:53Z" level=error msg="Error scraping for collect.slave_status: Error 1227: Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation" source="exporter.go:171"