Giter Club home page Giter Club logo

opentelemetry-collector-monitoring's Introduction

OpenTelemetry (OTEL) collector monitoring

Metrics

Collector can expose Prometheus metrics locally on port 8888 and path /metrics. For containerized environments it may be desirable to expose this port on a public interface instead of just locally.

service:
  telemetry:
    metrics:
      address: 127.0.0.1:8888
      level: detailed   

Collector can scrape own metric via own metric pipeline, so real configuration can looks like:

extensions:
  sigv4auth/aws:

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: otel-collector-metrics
        scrape_interval: 10s
        static_configs:
          - targets: ['127.0.0.1:8888']

exporters:
  prometheusremotewrite/aws:
    endpoint: ${PROMETHEUS_ENDPOINT}
    auth:
      authenticator: sigv4auth/aws
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 10s
      max_elapsed_time: 30s

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: []
      exporters: [awsprometheusremotewrite]
  telemetry:
    metrics:
      address: 127.0.0.1:8888
      level: detailed

Grafana dashboard for OpenTelemetry collector metrics

OpenTelemetry collector dashboard

Prometheus alerts

Recommended Prometheus alerts for OpenTelemetry collector metrics:

groups:
  - name: opentelemetry-collector
    rules:
      - alert: processor-dropped-spans
        expr: sum(rate(otelcol_processor_dropped_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been dropped by processor
          description: Maybe collector has received non standard spans or it reached some limits
      - alert: processor-dropped-metrics
        expr: sum(rate(otelcol_processor_dropped_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been dropped by processor
          description: Maybe collector has received non standard metric points or it reached some limits
      - alert: receiver-refused-spans
        expr: sum(rate(otelcol_receiver_refused_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been refused by receiver
          description: Maybe collector has received non standard spans or it reached some limits
      - alert: receiver-refused-metrics
        expr: sum(rate(otelcol_receiver_refused_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been refused by receiver
          description: Maybe collector has received non standard metric points or it reached some limits
      - alert: exporter-enqueued-spans
        expr: sum(rate(otelcol_exporter_enqueue_failed_spans{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some spans have been enqueued by exporter
          description: Maybe used destination has a problem or used payload is not correct
      - alert: exporter-enqueued-metrics
        expr: sum(rate(otelcol_exporter_enqueue_failed_metric_points{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some metric points have been enqueued by exporter
          description: Maybe used destination has a problem or used payload is not correct
      - alert: exporter-failed-requests
        expr: sum(rate(otelcol_exporter_send_failed_requests{}[1m])) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Some exporter requests failed
          description: Maybe used destination has a problem or used payload is not correct
      - alert: high-cpu-usage
        expr: max(rate(otelcol_process_cpu_seconds{}[1m])*100) > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High max CPU usage
          description: Collector need to scale up

Documentation

opentelemetry-collector-monitoring's People

Contributors

jangaraj avatar kevinnoel-be avatar kibaamor avatar lindeskar avatar rowleyaj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

opentelemetry-collector-monitoring's Issues

processor label missing in otel metrics

Hi! Thanks for the great dashboard!

I noticed one issue, not sure what's the reason for this.
The dashboard relies on processor label in some queries but i actually can't see this label in any of otel collector metrics. I'm operating latest version of collector under otel/opentelemetry-collector-contrib:0.97.0 image.

Does this label come from some older versions or am I missing something?

I have detailed log level enabled:

    service:
      telemetry:
        metrics:
          level: detailed

Here is result of my search for processor label:

image

I think it's easy to workaround, just want to clarify the reason first.

Thanks!

Contributing

Hi @jangaraj,

I utilise your otel dashboard in Grafana, great work thank you - some of the panels I didn't even know about before using your dashboard!

I wanted to ask if you accept contributions to the dashboard?

During setup I've noticed something that doesn't incorporate the job variable. I'd also like to add an additional variable that we use locally, but I'm not sure if Grafana has ever come up with a way to do this cleanly but still incorporate changes. Do you know?

Thanks, Alex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.