mq-metrics-samples version: 4.1.3 Hello. I've compiled

The only thing I've seen that got close to this was when the subion queue got fu

Some metrics are not always being exported,about ibm-messaging/mq-metric-samples

Comments (18)

ibmmqmet commented on May 26, 2024

The only thing I've seen that got close to this was when the subscription queue got full. There might be a warning in the output about the maxdepth on the model queue used for subscriptions, which depends on the number of queues being monitored. In 30s I'd normally expect 2 or 3 of the qmgr metrics to be published, and the most recent one always getting returned.

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Hi Mark. I have exactly the same problem. Not all metrics are available all the time but only a subset (e.g. ibmmq_queue_depth is often missing). Prometheus graphs show many large holes where not all stats are returned. I have tried investigate the issue and indeed the result returned by the scrape seems missing many metrics.
Mine are QMs 9.1.0.4 running on RHEL7 and have a big amount of queues and running channels. I can see more than 7000 subscriptions of type "$SYS/MQ/INFO/QMGR/QMH1.PDT/Monitor/STATQ/BRIDGE.xxx/GET", etc.
I have set MonitorPublishHeartBeat=30 to 30 seconds.
Watching Prometheus qdepth (default 60k), it seems the "right" amount of messages (~7000) are generated at the right interval.
Something which might be important:
I have the same MQ Prometheus exporter running on a "clone" QM 9.1.0.2 (same amount of queues but no traffic hence no running channels) which is working fine.
I have tried compile and use the latest mq-metrics from master repo but result is the same.

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

What frequency are you using for Prometheus collection? Also 30s? If so, I would not be surprised to find that having identical timings means that sometimes the data obtained via pub/sub (as opposed to the explicitly gathered STATUS polling) is not available exactly when it's being requested. And the time taken to do the actual collection in the exporter process is maybe also going to get involved, which will be longer if collecting lots of channel status too.

I saw something similar happening when I was connecting as a client to a very remote/slow qmgr - the time taken to collect the data was even longer than the Prometheus scrape_interval which led to interesting effects.

Grafana does have an option on its graphs to keep displaying the previous value if there's nothing for the most recent interval. That can at least help with cosmetics so no gap is shown.

Right now it's difficult/very slow for me to do a lot of debug scenarios for these environents but I'll take another look to see if there's anything I can spot.

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Thanks Mark. I had already noticed that and hence our QM publishes every 30 seconds and Prometheus server scrapes every 60 seconds.
Focusing on queue statistics, following are the stats which seem to be always present:
ibmmq_queue_attribute_max_depth
ibmmq_queue_attribute_usage
ibmmq_queue_input_handles
ibmmq_queue_oldest_message_age
ibmmq_queue_output_handles
ibmmq_queue_qtime_long
ibmmq_queue_qtime_short
ibmmq_queue_time_since_get
ibmmq_queue_time_since_put
so either we get the full metrics or just those.
I am trying to play with the configuration (e.g. do not display channels, reduce number of queues to monitor with filters, etc) and will let you know of any finding.
Please also let me know if I can run any test that might help your investigation or if there is any "debug" switch I can turn on to produce more info.

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

Those queue metrics are the ones coming from DIS QSTATUS, rather than from the published resource stats, so that list looks like the publications have not been generated from the qmgr in the interval.

Do you have qmgr performance events enabled on the qmgr and modelq? I wonder if the subscriber's queue is getting filled and a Q_FULL event would show that. Attribute QDPHIEV(ENABLED) on SYSTEM.DEFAULT.MODEL.QUEUE and PERFMEV(ENABLED) on qmgr.

One stupid question: Is the prometheus engine the only program pulling the stats? When I've been testing things, I've sometimes had a browser also pointed at port 9157 and then got confused when it picked up the stats instead of real prometheus.

One other question: Are you trapping stdout/stderr from the collector program to a log file? If so, is there an error "superfluous response.WriteHeader" getting logged? I seem to be getting that error from the prometheus client library when there's a large number of queues being monitored and it appears to stop metrics going to prometheus. (That's something I need to fix anyway but it's not easy to upgrade to the level of prometheus client that claims to solve this - it'll have to be done, just might take a while.)

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Thanks Mark.

DIS QSTATUS : I also understood that looking at your code.

subscriber's queue : I am sure that queue doesn't get full. The queue is monitored and is created from a "prometheus" model queue with sufficient space to hold publication for more than 15 minutes

One stupid question: the prometheus engine is the only program pulling the stats BUT there are two scrapers (for high availability) polling every minute (QM publishes every 30 sec). I understand the second of the two which scrapes, should get a response without qdepth but the target DB is still the same. I will try to see if I can reduce the scrape processes to one.

I have one question for you about this: it would be good to get the "same" reply within a polling interval like you do for displaying qstatus using the pollStatus variable in cmd/mq_prometheus/exporter.go.
Do you think it would be a good idea to store the content of mqmetric.ProcessPublications() and reply with same data within same polling interval?
This way the reply would be more "deterministic" and I would not care who asks first and who asks second (or third)

One other question: yes, they are not many but I do see this line
http: superfluous response.WriteHeader call from github.com/ibm-messaging/mq-metric-samples/vendor/github.com/prometheus/client_golang/prometheus.(*responseWriterDelegator).WriteHeader (http.go:316)

Thanks again for taking time to look into our issues . Your tool will be vital for us.
I also plan to issue a pull request for another feature not related to this issue. Is there any special process to follow or any special instruction to read?

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

One more thing I've now seen on my tiny machine while trying to monitor a huge number of queues. The processing was taking enough time that it exceeded the prometheus scrape_timeout limit (default 10s) so the responses were not getting into the database. I remember seeing that before when I was trying to monitor a queue manager on the other side of the atlantic from my desktop box ... even with a small number of queues that was taking far too long to collect all the status responses.

It's not a good idea to push the same counters because that would confuse aggregrations like sum() for many of the metrics. I think all of the qstatus responses are absolute values, rather than being deltas, so not so much of a problem with duplicating them if that is happening.

For additional PRs, see the README in the repo which links to the CLA (contributor license agreement). The template for PRs should give the boxes to tick.

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Just to let you know I am still investigating this issue and I plan to report the result of our investigation as soon as it is completed. I have now added some printf to your code to find out who is really calling the mq_exporter, how frequently and how many messages the mq_exporter processes at each scrape request.

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Just FYI, I have identified the issue on our side.
After having added a printf to show the remote ip address of the scraping applications, we have realized there were two extra scraping PODs in our cloud environment.
Since two days graphs are normal and we have no evidence of mq_prometheus exporter losing metrics.
For other people looking at this issue.
What is very important:

make sure you control the number of your scraping applications
make sure your scraping applications write in the same DB
make sure the scraping interval is higher then QM publication interval (e.g. 1m vs 30s)
make sure the scraping application waits long enough for the zip file to be returned by mq_prometheus. I have seen mq_prometheus jumping to 100% of one CPU for several seconds (my QM has many thousands of queues)

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

Thanks for the update. Where did you get the remote IP address from? If it's readily accessible to the exporter code, I'd think about adding that as a debug print (but not if it means changes to the prometheus client library itself). I am going to put in a warning message about the collection time if it exceeds 10s just to make sure people are aware of the default scrape_timeout.

from mq-metric-samples.

fnuzzi commented on May 26, 2024

I have adding following line:
fmt.Printf("Scraping host %s - ", req.RemoteAddr)
in http.HandlerFunc() of http.go
Indeed this would be quite a useful information. Actually it was THE info which helped me identify the issue.

On a side note. I have also posted a pull request but I haven't fulfilled the requisite about "You have added tests for any code changes". I do not know if this stops any automatic notification or the likes. I just would only like confirmation that you see it.

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

pity the adderss not directly accessible without changes to external code. oh well.

I'll comment on the PR on its thread - I can see it.

from mq-metric-samples.

dhineshthiruppathi commented on May 26, 2024

@ibmmqmet

Does this exporter takes care of accounting and statistics messages from SYSTEM.ADMIN.ACCOUNTING.QUEUE and SYSTEM.ADMIN.STATISTICS.QUEUE in the queue manager ? I see messages piling up on these queues even though the exporter is running.

 QMNAME(SAMPQMGR) ACCTINT(1800)
 ACCTMQI(ON) ACCTQ(ON)
 STATCHL(MEDIUM) STATINT(1800)
 STATMQI(ON) STATQ(ON)

Please let me know if above properties have to enabled for using this exporter.

Is it only queue manager and queue metrics that are published on system topics for monitoring, since I don't see channel metrics in the topics graph.
I see lots of queue and queue manager metrics missing in our graphs. Our configs has 2+ Prometheus instance scraping for high availability. The prometheus scrape interval configured is 30s.
I've tried to set MonitorPublishHeartBeat to 5 or 25 seconds, and also the pollInterval as 10s, but still I don't see the metrics.

Kindly suggest.

from mq-metric-samples.

bismarck commented on May 26, 2024

@ibmmqmet thank you for sharing this exporter. I work with @dhineshthiruppathi and wanted share some concerns regarding the nature of this exporter that seem to go against the Prometheus best practices for writing exporters.

The recommended approach to running Prometheus in a highly available setup is to run two or more identical Prometheus servers. As a result, two scrapes can happen at the same time. For this reason, the Prometheus docs recommend not to use the direct instrumentation approach and instead update the metrics on each scrape (using MustNewConstMetric in your Collect() method).

My other concern is regarding the use of pub/sub to gather metrics from MQ. The Prometheus docs state:

Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers. That is, all scrapes should be synchronous.

I'm unsure if it is even possible to gather these metrics synchronously (via an MQ API). That same section of the docs goes on to state:

If a metric is particularly expensive to retrieve, i.e. takes more than a minute, it is acceptable to cache it. This should be noted in the HELP string.

Perhaps @fnuzzi suggestion of storing the last publication is not such a bad idea?

from mq-metric-samples.

fnuzzi commented on May 26, 2024

Since this is getting interesting, let's put more info here to make the picture as clear as possible for Mark (even if there might be 'nothing' he can do if something is not changed in the core MQ code).

The issue I am facing now after my investigations and tuning is that we run two Prometheus scrapers for high availability like @bismarck explains above.
Despite the exporter producing its output regularly, there will always be one of the scraping applications hitting an empty queue and retrieving no metrics. We go through Thanos from our Grafana dashboard, which selects the 'best' of the two databases but that db is the one which has less 'holes' in it.
We would need to be able to 'join' the result of the two db but Grafana does not support such a feature (maybe a next major release would).
On the other side, MQ metrics are mainly gauges not counters, so I understand the caveat that storing the last publication would produce wrong results. There are definitely advantages to produce metrics asynchronously at regular intervals but this model simply doesn't fit well Prometheus.

The second big problem concerns the huge number of subscriptions created to get the metrics. Each of our QM has thousands of queues and one is used by many publish-subscriber applications as a hub. Trying to display them for troubleshooting, they just 'disappear' among all the $SYS topic created for the metrics. Obviously you can filter those out but it really seems like 'there should be a better way'.
Also, the 'put-on-topic' and 'published-to-subscriber' stats are global values so I cannot distinguish how many put-on-topic are done because of metrics collection and how many are instead done by my subscribers applications (which is really want I want to monitor).

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

Trying to answer some of these recent comments ...

The exporter does not read SYSTEM.ADMIN.STATISTICS (or .ACCOUNTING) QUEUE. There are separate tools provided with the MQ product that can process events written there. For example, amqsevt -o json might be interesting.
The queue manager today only publishes information about itself and queues (and from 9.1.5 also about some application behaviour). It does not publish about other object types. The exporters use DISPLAY xxSTATUS to get that kind of information for channels etc.
The metrics are delivered to whichever Prometheus instance has requested the scrape. The metrics are only available once - delivering the same value twice would break any aggregration. If Prometheus provided something like a unique identifier during the scrape then it might be possible to make it work. But right now there's no way to distinguish things like a primary or secondary instance. I'd really hope that any HA solution included either a database merge option somehow, or you might need to run two separate instances of the collector, one for each db.
The metrics are processed and delivered synchronously when Prometheus requests a scrape. All the available publications are collected and aggregrated in the exporter, and the DIS xxSTATUS commands are all issued then. If there are no publications referring to a particular metric then we don't send any empty gauge; the gauge is simply not sent at all on that scrape.
The latest version of the code (v5.0.0) does have an option that reduces the number of subscriptions, at the cost of not collecting all of the metrics. See the queueSubscriptionSelector option. Setting it to "PUT,GET" will halve the number of subscriptions but you won't then see the number of OPEN calls made for a queue. That might be a reasonable tradeoff. It would be nice if the queue manager itself supported wildcard subscriptions for these resources but that's not in today's product so the multiple subs have to be made.
There's now a warning issued if the collection time during a scrape seems to be taking too long. But when the exporter is alongside the queue manager, even thousands of queues do not seem to extend the collection too much. It was different when I tried having the exporter and the queue manager on opposite sides of the Atlantic ... one collection would start before the previous had finished so I had to tune the scrape interval for that situation.
I understand about the global nature of the publication counters that come from the qmgr resource publications. But there are separate metrics for topics and subscriptions showing the number of messages send/received. So you might be able to use those. I might look at adding a pseudo-metric that shows how many publications were read by the collector on each scrape so it's easier to remove those from a total.

from mq-metric-samples.

ibmmqmet commented on May 26, 2024

v5.1.0 was released last week and has a new pseudo-metric: qmgr_exporter_publications for how many consumed on each scrape to deal with the final bullet in previous comment

from mq-metric-samples.

dunchych commented on May 26, 2024

I used the metrics exporter configured with MQ client as a separate container on OpenShift monitoring MQ in a remote container and - unlike with the server-side option - did not see the intermittent drop-off of the queue-depth metric. If worked just fine..

from mq-metric-samples.

Some metrics are not always being exported about mq-metric-samples HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent