Collection fails for Kafka using release 1.0.0. Investigation is underway.

Background As part of the work in <code class="notr

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Possibly... a couple of options... A <p dir="au

Collection fails for Kafka using release 1.0.0,about prometheus/jmx_exporter

Comments (14)

dhoard commented on July 4, 2024

Background

As part of the work in client_java 1.x.x to integrate with OpenMetrics/OpenTelemetry metric names are enforced.

Kafka appears to create metrics with suffixes that violate the OpenMetrics/OpenTelemetry specification.

Stacktrace

An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: 'kafka_server_socketservermetrics_failed_handshake_total': Illegal metric name. The metric name must not include the '_total' suffix. Call PrometheusNaming.sanitizeMetricName(name) to avoid this error.
	at io.prometheus.metrics.model.snapshots.MetricMetadata.validate(MetricMetadata.java:106)
	at io.prometheus.metrics.model.snapshots.MetricMetadata.<init>(MetricMetadata.java:63)
	at io.prometheus.metrics.model.snapshots.MetricSnapshot$Builder.buildMetadata(MetricSnapshot.java:80)
	at io.prometheus.metrics.model.snapshots.GaugeSnapshot$Builder.build(GaugeSnapshot.java:129)
	at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:832)
	at io.prometheus.metrics.model.registry.MultiCollector.collect(MultiCollector.java:26)
	at io.prometheus.metrics.model.registry.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
	at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.scrape(PrometheusScrapeHandler.java:112)
	at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.handleRequest(PrometheusScrapeHandler.java:53)
	at io.prometheus.metrics.exporter.httpserver.MetricsHandler.handle(MetricsHandler.java:43)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:98)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:851)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:95)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Sanitizing the name could result in duplicate metrics, violating metric exposition constraints:

sanitized(foo_total) = foo
sanitized (foo) = foo

from jmx_exporter.

dhoard commented on July 4, 2024

Debugging shows two similar MBeans registered with different names.

MBean 1 has a dash (-) between topics and partitions...

v3.topics-partitions

MBean 2 has a period (.) between topics and partitions

v3.topics.partitions-

These both will get sanitized to the same name with the dashes (-) and the periods (.) replaced with underscores (_), resulting in duplicate metrics.

 254201 [FINE] io.prometheus.jmx.JmxScraper kafka.rest{type=jersey-metrics}v3.topics-partitions-reassignment.list.response-rate scrape: 0.0
 254202 [FINE] io.prometheus.jmx.JmxCollector add metric sample: kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate [] [] 0.0
 254203 [FINE] io.prometheus.jmx.JmxCollector matchedRule.name [kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate] sanitizedName [kafka_rest_jersey_metrics_v3_to        pics_partitions_reassignment_list_response_rate]

 254573 [FINE] io.prometheus.jmx.JmxScraper kafka.rest{type=jersey-metrics}v3.topics.partitions-reassignment.list.response-rate scrape: 0.0
 254574 [FINE] io.prometheus.jmx.JmxCollector add metric sample: kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate [] [] 0.0
 254575 [FINE] io.prometheus.jmx.JmxCollector matchedRule.name [kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate] sanitizedName [kafka_rest_jersey_metrics_v3_to        pics_partitions_reassignment_list_response_rate]

from jmx_exporter.

dhoard commented on July 4, 2024

Debugging shows that...

jmx_exporter/collector/src/main/java/io/prometheus/jmx/JmxCollector.java

Line 402 in 6725b2f

static String safeName(String name) {

... converts . and - (unsafe characters per the Prometheus metrics specification) to _, which results in duplicate metric names for two different MBeans.

In these scenarios, the first MBean metric would be used. Subsequent MBeans metrics (with the same name) would be ignored.

from jmx_exporter.

fstab commented on July 4, 2024

Thanks a lot @dhoard for analyzing this!!!

I guess adding a label like mbean_name in case there is a name conflict is the best way to solve this. Something like this:

v3_topics_partitions{mbean_name="v3.topics-partitions"} ...
v3_topics_partitions{mbean_name="v3.topics.partitions"} ...

If we add that label only in case of name collisions it should have minimal impact.

However, this will beak if the metrics have different types, so we should set the type UNKNOWN for all metrics with name collisions.

I'm wondering what the previous versions of jmx_exporter did. My guess is metrics were just overwriting each other, so it was basically random which value you got.

from jmx_exporter.

dhoard commented on July 4, 2024

While the example uses UNNOWN metric types (because of Kafka), the issue can occur for UNTYPES (treated as UNKNOWN), COUNTER and GAUGE metrics.

I'm wondering what the previous versions of jmx_exporter did. My guess is metrics were just overwriting each other, so it was basically random which value you got.

@fstab Correct. The previous jmx_exporter versions would use the first MBean, dropping the metrics for the second MBean, etc. if the sanitized name matched, resulting in missed metrics for some MBeans.

~~When an MBean has tabular data with no keys, tracking the MBean name alone isn't enough to guarantee uniqueness.~~ - JUnit test issue

from jmx_exporter.

dhoard commented on July 4, 2024

@fstab I have code working on my branch (https://github.com/dhoard/jmx_exporter/tree/issue-958) by adding an _x_id_ label to every non-JVM metric.

I have gone down this path because...

the output wouldn't provide any information on which MBean was used to generate the first metric....

# HELP kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate The average number of HTTP responses per second. kafka.rest:name=null,type=jersey-metrics,attribute=v3.topics-partitions-reassignment.list.response-rate
# TYPE kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate untyped
kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate{} 0.0
kafka_rest_jersey_metrics_v3_topics_partitions_reassignment_list_response_rate{_x_id_="kafka.rest:type=jersey-metrics,attribute=v3.topics.partitions-reassignment.list.response-rate"} 0.0

it would require keeping a Map of previously encountered metric names and performing a lookup when processing every MBean attribute value to check for a duplicate. (XSnapshot.Builder classes have no computeIfAbsent method to check for a duplicate datapoint.)

from jmx_exporter.

dhoard commented on July 4, 2024

My branch has been updated to include a label _x_id_ on all non-JVM metrics that contains a Murmur3 hash of ObjectName + ,atttribute= + attribute name.

Analysis

The response size for a Kafka server with 86,415 metrics and the following rules...

rules:
- pattern: ".*"

... increases by ~140KB (11.5% increase)

The transfer time for 140KB on a 1Gbps network is ~1.1 milliseconds.
Calculating a Murmur3 hash is performant and the probability of a hash collision is practically 0.

P ≈ 1− (e-1/2147483648)

(e-1/2147483648) ≈ 1

P ≈ 1 - 1

P ≈ 0

from jmx_exporter.

fstab commented on July 4, 2024

Please let me know when your branch is ready to review!

from jmx_exporter.

fstab commented on July 4, 2024

Can we add the extra label only to the metrics that have a name collision, rather than to all metrics? I think name collisions are rare, so most metrics won't need that label.

from jmx_exporter.

dhoard commented on July 4, 2024

Possibly... a couple of options...

Change the client_java ...Snaphot.Builder classes to throw a DuplicateLabelsException when adding a ...DataPointSnapshot value with colliding labels. This breaks the builder pattern.

The JMX exporter code would need to keep maps of the ...DataPointSnapshot.Builder objects independently of the ...Snapshot.Builder objects.

When building the MetricSnapshots... all of the logic to detect duplicates/adding an additional label for uniqueness would have to be implemented in the JMX Exporter code. (Collect everything in isolation... merge in the end.)

This is further complicated because none of the builder classes implement hashCode() / equals(Object other) so they can't be used as Map keys.

Thoughts?

from jmx_exporter.

fstab commented on July 4, 2024

I am confused. Is the problem conflicting metric names, or conflicting labels?

from jmx_exporter.

dhoard commented on July 4, 2024

Let me try to clarify...

Test MBean

package io.prometheus.jmx;

import javax.management.JMException;
import javax.management.MBeanServer;
import javax.management.ObjectName;

public interface CollidingMBean {

    int getXid();
}

class Colliding implements CollidingMBean {

    private final int value;

    public Colliding(int value) {
        this.value = value;
    }

    @Override
    public int getXid() {
        return value;
    }

    static void registerBean(MBeanServer mbs) throws JMException {
        ObjectName objectName =
                new ObjectName("io.prometheus.jmx.test:type=Colliding-Bean");
        Colliding mBean = new Colliding(123);
        mbs.registerMBean(mBean, objectName);

        objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding.Bean");
        mBean = new Colliding(321);
        mbs.registerMBean(mBean, objectName);

        objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding_Bean");
        mBean = new Colliding(246);
        mbs.registerMBean(mBean, objectName);

        objectName = new ObjectName("io.prometheus.jmx.test:type=Colliding|Bean");
        mBean = new Colliding(246);
        mbs.registerMBean(mBean, objectName);
    }
}

Test Scenario

io.prometheus.jmx.test:type=Colliding-Bean
io.prometheus.jmx.test:type=Colliding.Bean
io.prometheus.jmx.test:type=Colliding_Bean
io.prometheus.jmx.test:type=Colliding|Bean

Exporter code

During scraping, the exporter "sanitizes" the name to match the Prometheus/OpenMetrics metrics name specification and adds the attribute.

sanitized(io.prometheus.jmx.test:type=Colliding-Bean) = io_prometheus_jmx_test_Colliding_Bean_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean) = io_prometheus_jmx_test_Colliding_Bean_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean) = io_prometheus_jmx_test_Colliding_Bean_attributeName
sanitized(io_prometheus_jmx_test_Colliding_Bean) = io_prometheus_jmx_test_Colliding_Bean_attributeName

Because the "sanitized" name is the same, these independent MBean are merged into a single metric with multiple datapoint snapshots, each having the same labels.

switch (matchedRule.type) {
                case "COUNTER":
                    {
                        CounterSnapshot.Builder counterBuilder =
                                countersMap.computeIfAbsent(
                                        sanitizedName,
                                        key ->
                                                CounterSnapshot.builder()
                                                        .name(sanitizedName)
                                                        .help(finalMatchedRule.help));

                        counterBuilder.dataPoint(
                                CounterSnapshot.CounterDataPointSnapshot.builder()
                                        .labels(labels)
                                        .value(value.doubleValue())
                                        .build());

                        break;
                    }
                case "GAUGE":
                    {
                        GaugeSnapshot.Builder gaugeBuilder =
                                gaugeMap.computeIfAbsent(
                                        sanitizedName,
                                        key ->
                                                GaugeSnapshot.builder()
                                                        .name(sanitizedName)
                                                        .help(finalMatchedRule.help));

                        gaugeBuilder.dataPoint(
                                GaugeSnapshot.GaugeDataPointSnapshot.builder()
                                        .labels(labels)
                                        .value(value.doubleValue())
                                        .build());

                        break;
                    }
                case "UNKNOWN":
                case "UNTYPED":
                default:
                    {
                        UnknownSnapshot.Builder unknownBuilder =
                                unknownMap.computeIfAbsent(
                                        sanitizedName,
                                        key ->
                                                UnknownSnapshot.builder()
                                                        .name(sanitizedName)
                                                        .help(finalMatchedRule.help));
                        unknownBuilder.dataPoint(
                                UnknownSnapshot.UnknownDataPointSnapshot.builder()
                                        .labels(labels)
                                        .value(value.doubleValue())
                                        .build());

                        break;
                    }
            }

When the code to build the MetricSnapshots is executed...

        for (CounterSnapshot.Builder counter : receiver.countersMap.values()) {
            result.metricSnapshot(counter.build());
        }

        for (GaugeSnapshot.Builder gauge : receiver.gaugeMap.values()) {
            result.metricSnapshot(gauge.build());
        }

        for (UnknownSnapshot.Builder unknown : receiver.unknownMap.values()) {
            result.metricSnapshot(unknown.build());
        }

... we experience the DuplicateLabelsException

Initial Approach

My initial approach was to keep a map of "sanitized" name to ...SnapShot.Builder and a separate map of "sanitized" name to a list of ...DataPointSnapshot.Builders.

This doesn't work because the ...DataPointSnapshot.Builders have no methods to modify their contents.

        Map<String, CounterSnapshot.Builder> snapshotBuilderMap = new HashMap<>();
        Map<String, List<CounterSnapshot.CounterDataPointSnapshot.Builder>> snapshotDataPointBuilderMap = new HashMap<>();

        for (Map.Entry<String, CounterSnapshot.Builder> entry : snapshotBuilderMap.entrySet()) {
            String key = entry.getKey();
            CounterSnapshot.Builder builder = entry.getValue();
            List<CounterSnapshot.CounterDataPointSnapshot.Builder> list = snapshotDataPointBuilderMap.get(key);

            for (CounterSnapshot.CounterDataPointSnapshot.Builder dataPointBuilder : list) {
                try {
                    builder.dataPoint(dataPointBuilder.build());
                } catch (DuplicateLabelsException e) {
                    // CounterSnapshot.CounterDataPointSnapshot.Builder doesn't
                    // have a method to add a label to the existing labels
                    // and doesn't have a method to get the current labels
                }
            }
        }

I have been trying to find a solution without requiring a client_java change/release.

from jmx_exporter.

fstab commented on July 4, 2024

Thanks a lot for the example, that was helpful.

I implemented a proposal how to solve this in #960. What do you think?

from jmx_exporter.

dhoard commented on July 4, 2024

Resolve via #960. (post 1.0.0 release)

from jmx_exporter.

Collection fails for Kafka using release 1.0.0 about jmx_exporter HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent