Comments (18)
Thanks for taking on this talk. I've put it on the road map, but I welcome any PR to work on this.
Before you start a PR, could we first agree on the new names for the metrics? I tried to improve the hystrix_command_* metrics in the previous release - let me know what I missed there.
In the hystrix_thread_* metrics I haven't put any thought or effort in yet.
For the sake of compatibility: please only add new metrics and mark the old ones deprecated. All deprecated metrics will be removed in the first 4.x release (but that will require all metrics to comply with prometheus metrics naming and to be stable)
from prometheus-hystrix.
OK...here's my proposal. I don't 100% understand all the metrics or all of premetheus' requirements, so it's partially a stab in the dark.
hystrix_command_error_total
hystrix_command_event_total
becomes
hystrix_command_total{type="error"}
hystrix_command_total{type="total"}
hystrix_command_latency_execute_seconds_bucket
hystrix_command_latency_execute_seconds_count
hystrix_command_latency_execute_seconds_sum
becomes
hystrix_command_latency_execute_seconds{type="bucket"}
hystrix_command_latency_execute_seconds{type="count"}
hystrix_command_latency_execute_seconds{type="sum"}
hystrix_command_latency_total_seconds_bucket
hystrix_command_latency_total_seconds_count
hystrix_command_latency_total_seconds_sum
becomes
hystrix_command_latency_seconds_total{type="bucket"}
hystrix_command_latency_seconds_total{type="count"}
hystrix_command_latency_seconds_total{type="sum"}
hystrix_thread_pool_completed_task_count
hystrix_thread_pool_count_threads_executed
hystrix_thread_pool_thread_active_count
hystrix_thread_pool_total_task_count
becomes
hystrix_thread_count{thread_type="completed_task"}
hystrix_thread_count{thread_type="threads_executed"}
hystrix_thread_count{thread_type="thread_active"}
hystrix_thread_count{thread_type="total_task"}
hystrix_thread_pool_rolling_count_threads_executed
hystrix_thread_pool_rolling_max_active_threads
becomes
hystrix_thread_rolling_count{thread_type="threads_executed"}
hystrix_thread_rolling_max{thread_type="active_threads"}
hystrix_thread_pool_largest_pool_size
hystrix_thread_pool_queue_size
becomes
hystrix_thread_pool_size{type="largest_pool"}
hystrix_thread_pool_size{type="queue"}
This brings up the question of whether these should be combined better and what exactly the difference is between a count and a sum:
hystrix_command_latency_execute_seconds{type="bucket"}
hystrix_command_latency_execute_seconds{type="count"}
hystrix_command_latency_execute_seconds{type ="sum"}
hystrix_command_latency_seconds_total{type="bucket"}
hystrix_command_latency_seconds_total{type="count"}
hystrix_command_latency_seconds_total{type="sum"}
from prometheus-hystrix.
The best practices for naming are listed here: https://prometheus.io/docs/practices/naming/
Regarding hystrix_command_error_total/hystrix_command_error_total vs. hystrix_command_total{type="error"}/hystrix_command_total{type="total"}
The rule to apply is "[E]ither the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics." -> you can't sum up errors and totals. Therefore I keep them as separate metrics.
Several Metrics with bucket/sum/count at the end
This is the standard metric names of a histogram and it is coded into the prometheus client (in fact all histograms for all languages have them). Therefore I keep them as they are. "sum" is the time used for all executions. "count" is the number of executions. You can use the two numbers to calculate the average execution time.
Several hystrix_thread_pool_... metrics
The rule of sum()/avg() applies here as well. I doubt that we can use labels here. But the names can certainly be improved. I love counters a lot, but not everything can be captured by a counter. The all-time-max or even rolling-max values are unfamiliar in Prometheus metrics, but very helpful in Hystrix world, as peaks would otherwise be missed.
Do you want to give it another go for the hystrix_thread_pool_... metrics?
from prometheus-hystrix.
I can take another go at it. Are there any metrics you think could be combined using labels? Our Prometheus admins are complaining that the bucket metrics alone are creating too many series, so maybe I'm going about this the wrong way.
Between hystrix_command_latency_execute_seconds_bucket{} and hystrix_command_latency_total_seconds_bucket{} we've generated 500k+ time series in < 12 hours
For these metrics:
hystrix_thread_pool_completed_task_count
hystrix_thread_pool_total_task_count
hystrix_thread_pool_thread_active_count
hystrix_thread_pool_count_threads_executed
The bare minimum to comply with the standards would be simply to make sure count is at the end for each metric (only one affected):
hystrix_thread_pool_thread_active_count
from prometheus-hystrix.
My understanding of Prometheus is that it doesn't matter if you use labels or not, as the metric name plus all label values will define the time series.
Example: if you have metric_a and metric_b or metric{label="a"} and metric{label="b"}, this will be two time series in the sense of Prometheus.
So conversion between the name and labels will not reduce the number of time series.
I'll do a sample calculation here (your case will most likely be different):
Today a hystrix_command_latency_total_seconds_bucket will have 31 time series per command. As there is also hystrix_command_latency_execute_seconds_bucket, this will give you 62 time series per command for both of them.
If you have 100 commands in your application, you will get 6200 time series for the buckets.
If you have 20 instances of your application running in parallel, you will get 124000 time series.
If you are re-deploying your application 4 times within 12 hours for example on a kubernetes cluster, you will get a different instance name every time. Each of them will provide its metrics as a different time series: 4x 124000 = 496000 time series.
To reduce the number of time series there are different options using all the factors above - not all of them might be options, and they have drawbacks:
- drop some metrics by adjusting the Prometheus Scraping configuration using rewrite rules (you can choose to keep just hystrix_command_latency_total_seconds_bucket and drop hystrix_command_latency_execute_seconds_bucket for example)
- reduce the number of commands in your application
- use fewer/other buckets (but this will reduce the accuracy) - this is not supported by this library yet
- ...
You can also ask for consulting services. I know some of the maintainers of Prometheus offer support. The company I work for offers consulting service as well.
from prometheus-hystrix.
Consider just using the default buckets for histogram, or at least fewer static ones, and not defining the exponential buckets here. Not sure why anyone would need 32 buckets of precision. Just use the default quantiles, and leverage Proms linear prediction functions for an approximation of that level of precision, without storing an insane number of time series.
Plus, lets get the metrics typed properly. Currently for me they are showing as 'untyped'
Please keep in mind, we're talking webscale apps here. Hundreds of containers scaling up and down frequently. Your example of redeploying apps 4 times in 12 hours. Now imagine that with a Prom retention of 2+ weeks. That level of time series generation is unsustainable.
from prometheus-hystrix.
@civik I think I need do some research to figure out what the purpose of these weird histograms are. The numeric values seem constant but could be based on a configuration value. On the other hand, I guarantee I don't have any configuration set to "2.5169093494697568" seconds.
In any case, it may make sense to just have a feature toggle for histograms.
from prometheus-hystrix.
@ahus1 So it looks like closing this issue is a matter of changing one metric name:
This
hystrix_thread_pool_count_threads_executed
Becomes
hystrix_thread_pool_threads_executed_count
from prometheus-hystrix.
@ahus1
Actually these two should be changed too
From
hystrix_thread_pool_rolling_count_threads_executed
hystrix_thread_pool_rolling_max_active_threads
To
hystrix_thread_pool_threads_executed_rolling_count
hystrix_thread_pool_active_threads_rolling_max
from prometheus-hystrix.
I would also add a switch guarding the exponential bucket definition, or just do away with it entirely. Looks like the class tries to provide a sane default: https://prometheus.io/client_java/io/prometheus/client/Histogram.html
from prometheus-hystrix.
I've pushed some code changes to make the buckets of the histogram configurable. Also the new default for the histograms of this publisher is the Histogram's default -- cc: @SimenB, @jtlisi
To use the previous exponential buckets, use the following:
HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets().buildAndRegister()
HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.001, 1.31, 30).buildAndRegister()
Please be aware that with the Histogram's default bucket size it is difficult to approximate the error of the calculated percentile. This is only possible with exponential buckets.
If you want to skip the histograms altogether, please use a Prometheus rewrite configuration. I will keep them in the default as Histograms are the only way you can aggregate latencies in a cluster.
The changes are already available in the snapshot Maven repository (see the main README at https://github.com/ahus1/prometheus-hystrix). Please give them a try and comment.
from prometheus-hystrix.
Sorry about the mess there, I forgot to deprecate the old metric names.
from prometheus-hystrix.
Local results:
- Without specifying buckets, it creates 15 default buckets
- Specifying buckets with
HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.0,1.31,6).buildAndRegister()
results in the following error and no buckets, but I do get all the rest of the hystrix metrics:
java.lang.IllegalStateException: Histogram buckets must be in increasing order: 0.0 >= 0.0
Which makes sense since nothing should be -lt 0.0
3. Specifying buckets with HystrixPrometheusMetricsPublisher.builder().withExponentialBuckets(0.001,1.31,6).buildAndRegister()
results in 7 total exponential buckets, which is nice.
I approve these changes.
from prometheus-hystrix.
@benzvan - thanks for the PR. I've squashed the changed and merged them.
There is an option to exclude all deprecated metrics, I extended this also to this also to HystrixPrometheusMetricsPublisherThreadPool.
I also changed "threads_executed_count" to "threads_executed_total", as the Prometheus Naming conventions state: "_total (for a unit-less accumulating count)."
Looking at the metrics again, could we remove rolling_threads_executed_count, as the threads_executed_total allows this value to be calculated by Prometheus rate() function?
from prometheus-hystrix.
Yes, that makes sense. Should we also deprecate rolling_max_active?
from prometheus-hystrix.
rolling_max_active is useful as thread_active_count will give the instant active threads. You can scape it as often as you want, but it will not provide information if there were spikes between scapes. rolling_max_active will give an indication if there has been a spike if you scape at least once within the rolling interval.
So I'll keep it for now.
from prometheus-hystrix.
This has been released as part of 3.3.0 and should be available on maven central shortly.
from prometheus-hystrix.
Awesome! Glad to be involved.
from prometheus-hystrix.
Related Issues (17)
- Negative latency when curcuit breaker is in open state HOT 6
- Ratpack Compatability HOT 4
- Error: text format parsing error in line 1: invalid metric name HOT 8
- java 7 compatibility HOT 6
- Fixed / global labels HOT 3
- hystrix_command_total and hystrix_command_error_total are difficult to use HOT 3
- Release on maven central HOT 8
- Need guide for transformation from old prometheus queries (3.4.0) to new queries (4.0.0) HOT 1
- Update <artifactId>simpleclient</artifactId> to 0.4.0 HOT 1
- Remove some of hystric metrics HOT 2
- prometheus-hystrix not working with micrometer.io libraries HOT 2
- Creating Dashboard for mean Latency HOT 2
- Reporting Hystrix Latency Percentiles as a Histogram HOT 16
- Remove some of hystrix metrics HOT 1
- Another strategy was already registered. HOT 7
- Graveyard this repo. HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prometheus-hystrix.