Giter Club home page Giter Club logo

Comments (18)

ahus1 avatar ahus1 commented on June 8, 2024

Thanks for taking on this talk. I've put it on the road map, but I welcome any PR to work on this.

Before you start a PR, could we first agree on the new names for the metrics? I tried to improve the hystrix_command_* metrics in the previous release - let me know what I missed there.

In the hystrix_thread_* metrics I haven't put any thought or effort in yet.

For the sake of compatibility: please only add new metrics and mark the old ones deprecated. All deprecated metrics will be removed in the first 4.x release (but that will require all metrics to comply with prometheus metrics naming and to be stable)

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

OK...here's my proposal. I don't 100% understand all the metrics or all of premetheus' requirements, so it's partially a stab in the dark.

hystrix_command_error_total
hystrix_command_event_total

becomes

hystrix_command_total{type="error"}
hystrix_command_total{type="total"}

hystrix_command_latency_execute_seconds_bucket
hystrix_command_latency_execute_seconds_count
hystrix_command_latency_execute_seconds_sum

becomes

hystrix_command_latency_execute_seconds{type="bucket"}
hystrix_command_latency_execute_seconds{type="count"}
hystrix_command_latency_execute_seconds{type="sum"}

hystrix_command_latency_total_seconds_bucket
hystrix_command_latency_total_seconds_count
hystrix_command_latency_total_seconds_sum

becomes

hystrix_command_latency_seconds_total{type="bucket"}
hystrix_command_latency_seconds_total{type="count"}
hystrix_command_latency_seconds_total{type="sum"}

hystrix_thread_pool_completed_task_count
hystrix_thread_pool_count_threads_executed
hystrix_thread_pool_thread_active_count
hystrix_thread_pool_total_task_count

becomes

hystrix_thread_count{thread_type="completed_task"}
hystrix_thread_count{thread_type="threads_executed"}
hystrix_thread_count{thread_type="thread_active"}
hystrix_thread_count{thread_type="total_task"}

hystrix_thread_pool_rolling_count_threads_executed
hystrix_thread_pool_rolling_max_active_threads

becomes

hystrix_thread_rolling_count{thread_type="threads_executed"}
hystrix_thread_rolling_max{thread_type="active_threads"}

hystrix_thread_pool_largest_pool_size
hystrix_thread_pool_queue_size

becomes

hystrix_thread_pool_size{type="largest_pool"}
hystrix_thread_pool_size{type="queue"}

This brings up the question of whether these should be combined better and what exactly the difference is between a count and a sum:

hystrix_command_latency_execute_seconds{type="bucket"}
hystrix_command_latency_execute_seconds{type="count"}
hystrix_command_latency_execute_seconds{type ="sum"}
hystrix_command_latency_seconds_total{type="bucket"}
hystrix_command_latency_seconds_total{type="count"}
hystrix_command_latency_seconds_total{type="sum"}

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

The best practices for naming are listed here: https://prometheus.io/docs/practices/naming/

Regarding hystrix_command_error_total/hystrix_command_error_total vs. hystrix_command_total{type="error"}/hystrix_command_total{type="total"}

The rule to apply is "[E]ither the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics." -> you can't sum up errors and totals. Therefore I keep them as separate metrics.

Several Metrics with bucket/sum/count at the end

This is the standard metric names of a histogram and it is coded into the prometheus client (in fact all histograms for all languages have them). Therefore I keep them as they are. "sum" is the time used for all executions. "count" is the number of executions. You can use the two numbers to calculate the average execution time.

Several hystrix_thread_pool_... metrics

The rule of sum()/avg() applies here as well. I doubt that we can use labels here. But the names can certainly be improved. I love counters a lot, but not everything can be captured by a counter. The all-time-max or even rolling-max values are unfamiliar in Prometheus metrics, but very helpful in Hystrix world, as peaks would otherwise be missed.

Do you want to give it another go for the hystrix_thread_pool_... metrics?

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

I can take another go at it. Are there any metrics you think could be combined using labels? Our Prometheus admins are complaining that the bucket metrics alone are creating too many series, so maybe I'm going about this the wrong way.

Between hystrix_command_latency_execute_seconds_bucket{} and hystrix_command_latency_total_seconds_bucket{} we've generated 500k+ time series in < 12 hours

For these metrics:

hystrix_thread_pool_completed_task_count
hystrix_thread_pool_total_task_count

hystrix_thread_pool_thread_active_count
hystrix_thread_pool_count_threads_executed

The bare minimum to comply with the standards would be simply to make sure count is at the end for each metric (only one affected):

hystrix_thread_pool_thread_active_count

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

My understanding of Prometheus is that it doesn't matter if you use labels or not, as the metric name plus all label values will define the time series.

Example: if you have metric_a and metric_b or metric{label="a"} and metric{label="b"}, this will be two time series in the sense of Prometheus.

So conversion between the name and labels will not reduce the number of time series.

I'll do a sample calculation here (your case will most likely be different):

Today a hystrix_command_latency_total_seconds_bucket will have 31 time series per command. As there is also hystrix_command_latency_execute_seconds_bucket, this will give you 62 time series per command for both of them.

If you have 100 commands in your application, you will get 6200 time series for the buckets.

If you have 20 instances of your application running in parallel, you will get 124000 time series.

If you are re-deploying your application 4 times within 12 hours for example on a kubernetes cluster, you will get a different instance name every time. Each of them will provide its metrics as a different time series: 4x 124000 = 496000 time series.

To reduce the number of time series there are different options using all the factors above - not all of them might be options, and they have drawbacks:

  • drop some metrics by adjusting the Prometheus Scraping configuration using rewrite rules (you can choose to keep just hystrix_command_latency_total_seconds_bucket and drop hystrix_command_latency_execute_seconds_bucket for example)
  • reduce the number of commands in your application
  • use fewer/other buckets (but this will reduce the accuracy) - this is not supported by this library yet
  • ...

You can also ask for consulting services. I know some of the maintainers of Prometheus offer support. The company I work for offers consulting service as well.

from prometheus-hystrix.

civik avatar civik commented on June 8, 2024

Consider just using the default buckets for histogram, or at least fewer static ones, and not defining the exponential buckets here. Not sure why anyone would need 32 buckets of precision. Just use the default quantiles, and leverage Proms linear prediction functions for an approximation of that level of precision, without storing an insane number of time series.

Plus, lets get the metrics typed properly. Currently for me they are showing as 'untyped'

Please keep in mind, we're talking webscale apps here. Hundreds of containers scaling up and down frequently. Your example of redeploying apps 4 times in 12 hours. Now imagine that with a Prom retention of 2+ weeks. That level of time series generation is unsustainable.

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

@civik I think I need do some research to figure out what the purpose of these weird histograms are. The numeric values seem constant but could be based on a configuration value. On the other hand, I guarantee I don't have any configuration set to "2.5169093494697568" seconds.

In any case, it may make sense to just have a feature toggle for histograms.

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

@ahus1 So it looks like closing this issue is a matter of changing one metric name:

This
hystrix_thread_pool_count_threads_executed
Becomes
hystrix_thread_pool_threads_executed_count

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

@ahus1
Actually these two should be changed too

From

hystrix_thread_pool_rolling_count_threads_executed
hystrix_thread_pool_rolling_max_active_threads

To

hystrix_thread_pool_threads_executed_rolling_count
hystrix_thread_pool_active_threads_rolling_max

from prometheus-hystrix.

civik avatar civik commented on June 8, 2024

I would also add a switch guarding the exponential bucket definition, or just do away with it entirely. Looks like the class tries to provide a sane default: https://prometheus.io/client_java/io/prometheus/client/Histogram.html

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

I've pushed some code changes to make the buckets of the histogram configurable. Also the new default for the histograms of this publisher is the Histogram's default -- cc: @SimenB, @jtlisi

To use the previous exponential buckets, use the following:

HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets().buildAndRegister()
HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.001, 1.31, 30).buildAndRegister()

Please be aware that with the Histogram's default bucket size it is difficult to approximate the error of the calculated percentile. This is only possible with exponential buckets.

If you want to skip the histograms altogether, please use a Prometheus rewrite configuration. I will keep them in the default as Histograms are the only way you can aggregate latencies in a cluster.

The changes are already available in the snapshot Maven repository (see the main README at https://github.com/ahus1/prometheus-hystrix). Please give them a try and comment.

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

Sorry about the mess there, I forgot to deprecate the old metric names.

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

Local results:

  1. Without specifying buckets, it creates 15 default buckets
  2. Specifying buckets with HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.0,1.31,6).buildAndRegister() results in the following error and no buckets, but I do get all the rest of the hystrix metrics:
java.lang.IllegalStateException: Histogram buckets must be in increasing order: 0.0 >= 0.0

Which makes sense since nothing should be -lt 0.0
3. Specifying buckets with HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.001,1.31,6).buildAndRegister() results in 7 total exponential buckets, which is nice.

I approve these changes.

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

@benzvan - thanks for the PR. I've squashed the changed and merged them.

There is an option to exclude all deprecated metrics, I extended this also to this also to HystrixPrometheusMetricsPublisherThreadPool.

I also changed "threads_executed_count" to "threads_executed_total", as the Prometheus Naming conventions state: "_total (for a unit-less accumulating count)."

Looking at the metrics again, could we remove rolling_threads_executed_count, as the threads_executed_total allows this value to be calculated by Prometheus rate() function?

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

Yes, that makes sense. Should we also deprecate rolling_max_active?

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

rolling_max_active is useful as thread_active_count will give the instant active threads. You can scape it as often as you want, but it will not provide information if there were spikes between scapes. rolling_max_active will give an indication if there has been a spike if you scape at least once within the rolling interval.

So I'll keep it for now.

from prometheus-hystrix.

ahus1 avatar ahus1 commented on June 8, 2024

This has been released as part of 3.3.0 and should be available on maven central shortly.

from prometheus-hystrix.

benzvan avatar benzvan commented on June 8, 2024

Awesome! Glad to be involved.

from prometheus-hystrix.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.