Giter Club home page Giter Club logo

grafana-loki-on-k8s's Introduction

Hi 👋, I'm Said Sef

Amazon AWS Google Cloud Linux Bash Terraform Terraform Cloud Golang NodeJS Python Docker Kaniko Kubernetes ArgoCD HELM Git GitHub GitLab PostgreSql Redis ElasticSearch Prometheus Grafana Splunk Consul Vault Tekton Jenkins GitHub Actions DroneCI scikit-learn Pandas Jupyter OpenAI

saidsef

saidsef saidsef saidsef saidsef saidsef

grafana-loki-on-k8s's People

Contributors

dependabot[bot] avatar saidsef avatar zizekuros avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

grafana-loki-on-k8s's Issues

Issue in deploying to GKE

Hi, @saidsef, great repo -- unfortunately get the following issue when deploying on GKE Autopilot, would appreciate some guidance.

Error from server (GKE Warden constraints violations): error when creating "./deployment": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}
Requested by user: 'cambai-cluster-service-account@camb-ai-82e9a.iam.gserviceaccount.com', groups: 'system:authenticated'.
Error from server (GKE Warden constraints violations): error when creating "./deployment": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}
Requested by user: 'cambai-cluster-service-account@camb-ai-82e9a.iam.gserviceaccount.com', groups: 'system:authenticated'.
Error from server (GKE Warden constraints violations): error when creating "./deployment": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}
Requested by user: 'cambai-cluster-service-account@camb-ai-82e9a.iam.gserviceaccount.com', groups: 'system:authenticated'.
Error from server (GKE Warden constraints violations): error when creating "./deployment": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}
Requested by user: 'cambai-cluster-service-account@camb-ai-82e9a.iam.gserviceaccount.com', groups: 'system:authenticated'.
Error from server (GKE Warden constraints violations): error when creating "./deployment": admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-disallow-privilege]":["container promtail is privileged; not allowed in Autopilot"],"[denied by autogke-no-write-mode-hostpath]":["hostPath volume run in container promtail is accessed in write mode; disallowed in Autopilot.","hostPath volume varlibdockercontainers used in container promtail uses path /var/lib/docker/containers which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume mntpods used in container promtail uses path /mnt/kubelet/pods which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume mntlibdockercontainers used in container promtail uses path /mnt/docker/containers which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."],"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}

Thanks

Kubernetes Service Account Token for mimir and temp

Problem Statement:
Kubernetes services Mimir and Tempo automatically mount service account token, this is unnecessary as neither use or need a service account.

This helps improve security posture and reduces the service area.

Proposed Solution:
Disable automatic service account token via automountServiceAccountToken: false

Add CI workflow

It would be good to add CI workflow to validate resources security and deployment state.

Add Grafana mimir

It might be worth adding Grafana Mimir, also it's worth examining how to configure Prometheus to write to Mimir for longterm storage and integrated as Grafana datasource.

Upgrade Loki and Promtail to v13 schema

Problem Statement:

Grafana Loki and Promtail have released a new version of the application that uses v13 schema, this has a breaking changes as structured metadata is enabled by default and requires tsdb and v13 schema or Loki won’t start.

Proposed Solution:

Upgrade both Loki and Promtail, also validate configuration files - update configuration were necessary - make sure services start and are able to receive and transmit data as per usual.

tempo crashes when disk space is exhausted

Tempo error message:

level=info ts=2023-08-12T20:56:42.949310535Z caller=compactor.go:183 msg="compacting block" block="&{Version:vParquet BlockID:ff30cfa5-d853-4c81-aee7-52081ad6af7f MinID:[0 251 43 51 55 68 170 141 171 74 97 57 122 1 49 204] MaxID:[255 169 151 64 30 229 87 81 95 171 67 125 85 199 80 41] TenantID:single-tenant StartTime:2023-08-12 04:22:03 +0000 UTC EndTime:2023-08-12 04:30:44 +0000 UTC TotalObjects:341 Size:78860 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:7468}"

Updated services vesions

There has been releases that address bug and security fixes for Grafana, Tempo, Loki, Promtail and Promethues. It would be great if we can update the minor versions asap.

tempo: disable K8s service account

Grafana Tempo does not require access to K8s api server, as such he auto mount of service account should be disabled.

This has the added benefit of tightening the security posture of the service

bug(promql): undefined variable error

Problem Statement:

During Prometheus config load:

{"caller":"manager.go:201","component":"rule manager","err":"/etc/config/alerting_rules.yml: group \"NginxController\", rule 4, \"NGINXSuddenDrop200s\": annotation \"summary\": template: __alert_NGINXSuddenDrop200s:1: undefined variable \"$lables\"","level":"error","msg":"loading groups failed","ts":"2024-07-26T17:59:14.515Z"}
{"caller":"main.go:1380","err":"error loading rules, previous rule set restored","level":"error","msg":"Failed to apply configuration","ts":"2024-07-26T17:59:14.515Z"}

It would appear that the variable requires double quote.

Proposed Solution:

Update PromQL rule and wrap double quote on the summary

Add metrics scrape for all services

All of these components do have /metrics endpoints, however not all services have the necessary annotations to enable prometheus scraping.

Please enable metrics scrape for all services!

tempo start error

Tempo start error:

level=error ts=2024-03-01T19:41:37.828456209Z caller=main.go:121 msg="error running Tempo" err="failed to init module services: error initialising module: server: failed to create server: listen tcp: address [[::0]]:3100: missing port in address"

Solution:

Due to upgrade in Tempo v2.4 in config overrides have been deprecated and might be generating error.

  • Update tempo config

[GKE issues] CAdvisor unresponsive

Hey Said,

I got around some of the issues by hacking the YAML with not the greatest level of understanding of each parameter. I'm now still facing issues with the grafana-cm.yaml not being apt for GKE -- wherein CAdvisor is unresponsive thereby disallowing me from accessing CPU / mem metrics in the cluster.

It'll be great if I can set up a short call with yourself to get this up and running -- will be a huge help for me and hopefully anybody else who uses this in the future. Thank you so much.

Here is my calendly if it makes things easier -- https://www.calendly.com/akshatp-cs/chat

bug(storage): mimir start error due to storage path

Problem Statement:

Error when Mimir starts:

error validating config: the configured blocks storage filesystem directory "/data/common" cannot overlap with the configured ruler storage filesystem directory "/data/common"; please set different paths, also ensuring one is not a subdirectory of the other one

Proposed solution:

Move storage to another directory i.e. /data/storage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.