rabbitmq / rabbitmq-management-agent Goto Github PK
View Code? Open in Web Editor NEWRabbitMQ Management Agent
Home Page: https://www.rabbitmq.com/
License: Other
RabbitMQ Management Agent
Home Page: https://www.rabbitmq.com/
License: Other
See rabbitmq/rabbitmq-server#1159. This is the same issue as #34 but in a different scenario.
See this rabbitmq-users thread.
Most of such options either cannot be sensibly serialised (e.g. their values are functions) or management UI and most users don't care about them appearing in GET /api/overview
output.
So we should filter them out rather than trying to find a way to convert them to something that will serialise.
I have referred the issue# 90, and it was fixed by increasing the interval time to 10 minutes, and available in version 3.7.24.
but I have installed rabbitmq 3.8.2 and erlang version "Erlang OTP 22 (10.6)", and I still facing the same issue, I am not deployed handle.exe from sysinternal, is there anyway to limit the log, or I have to do anything in config file.
See this report. This is the same issue as in rabbitmq/rabbitmq-server#91, except in a different module.
The process should not stop in init/1
, in fact, it should not attempt to retrieve the stats there, waiting for a moment is perfectly fine.
Hi,
When RabbitMQ is slow to start up it may generate lots of discarding messages.
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)
The cause of these messages is the following:
rabbitmq-management-agent/src/rabbit_mgmt_db_handler.erl
Lines 66 to 83 in 5bc5efd
Calling into this:
Basically mnesia still has the old pids inside which causes these messages.
Before restart these were the queues in the system:
[{amqqueue,{resource,<<"/">>,queue,<<"durable-queue">>},
true,false,none,[],<0.438.0>,[],[],[],undefined,undefined,
[],[],live,0,[],<<"/">>,
#{user => <<"guest">>}},
{amqqueue,{resource,<<"/">>,queue,<<"test_queue">>},
true,false,none,[],<0.441.0>,[],[],[],undefined,undefined,
[],[],live,0,[],<<"/">>,
#{user => <<"my_user">>}}]
After restart we get these errors:
2020-04-30 05:10:27.244 [info] <0.344.0> Management plugin: using rates mode 'basic'
2020-04-30 05:10:27.244 [info] <0.344.0> Running boot step recovery defined by app rabbit
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.441.0> in an old incarnation (2) of this node (3)
You can see that if this runs before the queues were properly restarted in a single node installation it may generate as many messages as there are queues. Not sure the reason but in some cases we can even see this in clustered situation, I guess after a full restart.
This can cause high memory usage, and I think we had in the past when there were multiple thousands of these messages and slow down start of RabbitMQ but now we managed to track this down.
I am not sure if this is a race condition between management agent starting up and starting of queues, or some other error but it is basically 100% reproducible on a Windows installation with 2 cores, 2 queues, in a VM on my machine.
I am not sure if there is a way to check if the process id is in the current incarnation, or maybe just check if the pid is alive is enough but it would be good to avoid these messages.
Also I am not sure if it causes any errors that the messages are actually lost in these cases.
Thanks
Will plugin configuration be adapted to the new format?
Expose the value returned by: erlang:system_info(scheduler_bind_type)
Sudden stop of tens of thousands of consumers leaves orphan consumer stats on the DB.
Caused by the orelse
condition in https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_gc.erl#L198, which should be an andalso
. Thus, stats will be kept only if both the process and the entity still exist.
It appears every ~5 seconds = ~17000 times = 1.7MB per day and is pretty useless. And also, makes analyzing logs harder.
It is not always possible to install Sysinternals on client's machine and, as you know, it can't be distributed with RabbitMQ (https://docs.microsoft.com/en-us/sysinternals/license-faq)
Is it possible to emit this error only once, for the first time only? And require to restart RabbitMQ after Sysinternals installation?
* sometimes, without any explainable reason, we get "Could not infer the number of file handles used: badarg" instead.
I have the following set up:
Rabbitmq cluster version 3.6.9 with 3 nodes in AWS EC2
Main client of rabbitmq cluster is Sensu with the followings: vhost '/sensu', user 'sensu', ha-policy '"^(results$|keepalives$)" '{"ha-mode":"all", "ha-sync-mode":"automatic"}'
So, sensu is able to use this cluster for its intended purpose. However, I do see the following error in sasl log every now and then:
=CRASH REPORT==== 28-Apr-2017::19:34:47 ===
crasher:
initial call: rabbit_mgmt_metrics_collector:init/1
pid: <0.27508.15>
registered_name: connection_coarse_metrics_metrics_collector
exception exit: {badarith,
[{rabbit_mgmt_metrics_collector,sum_entry,2,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,550}]},
{rabbit_mgmt_metrics_collector,
'-insert_entry_op/4-fun-0-',2,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,529}]},
{dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
{dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
{dict,update,4,[{file,"dict.erl"},{line,318}]},
{rabbit_mgmt_metrics_collector,insert_entry_op,4,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,528}]},
{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
{rabbit_mgmt_metrics_collector,aggregate_entry,4,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,213}]}]}
in function gen_server:terminate/6 (gen_server.erl, line 744)
ancestors: [rabbit_mgmt_agent_sup,rabbit_mgmt_agent_sup_sup,<0.362.0>]
messages: []
links: [<0.364.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 2586
stack_size: 27
reductions: 1848
neighbours:
=SUPERVISOR REPORT==== 28-Apr-2017::19:34:47 ===
Supervisor: {local,rabbit_mgmt_agent_sup}
Context: child_terminated
Reason: {badarith,
[{rabbit_mgmt_metrics_collector,sum_entry,2,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,550}]},
{rabbit_mgmt_metrics_collector,
'-insert_entry_op/4-fun-0-',2,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,529}]},
{dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
{dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
{dict,update,4,[{file,"dict.erl"},{line,318}]},
{rabbit_mgmt_metrics_collector,insert_entry_op,4,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,528}]},
{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
{rabbit_mgmt_metrics_collector,aggregate_entry,4,
[{file,"src/rabbit_mgmt_metrics_collector.erl"},
{line,213}]}]}
Offender: [{pid,<0.27508.15>},
{name,connection_coarse_metrics_metrics_collector},
{mfargs,
{rabbit_mgmt_metrics_collector,start_link,
[connection_coarse_metrics]}},
{restart_type,permanent},
{shutdown,30000},
{child_type,worker}]
I'm no expert in Erlang, but based on this piece of information it looks like it has something to do with this line:
https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_metrics_collector.erl#L550
My guess is the caller somehow passed an invalid type (not numeric) to sum_entry function. While this doesn't seem to cause any issue to our rabbitmq cluster, it's sort of misleading to see so many errors like this in the log file.
Let me know if you need any information.
See rabbitmq/discussions#10 for the background. This is fairly common to see specifically on Windows. Whatever the root cause may be, the function that evaluates the metric should simply return 0
if it cannot proceed.
The reset of the stats db should also reset the old_aggr_stats
stored in the collector processes.
Management plugin config:
{rabbitmq_management,
[{listener, [{port, 15672},
{ssl, false},
{ssl_opts, [{cacertfile, "/etc/rabbitmq/ssl/cacert.pem"},
{certfile, "/etc/rabbitmq/ssl/server.pem"},
{keyfile, "/etc/rabbitmq/ssl/server.key"}]}]},
{rates_mode, none},
{sample_retention_policies,
[{global, [{60, 5}]},
{basic, [{60, 20}]},
{detailed, [{10, 5}]}]}
]},
Log error:
=ERROR REPORT==== 30-Mar-2017::00:24:14 ===
** Generic server rabbit_mgmt_gc terminating
** Last message in was start_gc
** When Server state == {state,#Ref<0.0.1.100845>,120000}
** Reason for termination ==
** {badarg,[{erlang,node,[none],[]},
{rabbit_misc,is_process_alive,1,
[{file,"src/rabbit_misc.erl"},{line,872}]},
{rabbit_mgmt_gc,gc_process,3,
[{file,"src/rabbit_mgmt_gc.erl"},{line,136}]},
{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
{ets,do_foldl,4,[{file,"ets.erl"},{line,585}]},
{ets,foldl,3,[{file,"ets.erl"},{line,574}]},
{rabbit_mgmt_gc,handle_info,2,
[{file,"src/rabbit_mgmt_gc.erl"},{line,44}]},
{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,601}]}]}
when i try to get an individual queue by api "/api/queues/vhost/name" .I get 404.
when i try to delete an individual queue by api "/api/queues/vhost/name" .I get 405.
Reported on rabbitmq-users.
Running RabbitMQ version 3.7.8 on Windows
Got this message in log:
2018-09-24 01:24:21.086 [warning] <0.826.0> Could not parse handle.exe output: "\r\nNthandle v4.11 - Handle viewer\r\nCopyright (C) 1997-2017 Mark Russinovich\r\nSysinternals - www.sysinternals.com\r\n\r\nHandle type summary:\r\r\n ALPC Port : 1\r\r\n Desktop : 1\r\r\n Directory : 2\r\r\n EtwRegistration : 57\r\r\n Event : 181\r\r\n IoCompletion : 4\r\r\n IRTimer : 6\r\r\n Key : 116\r\r\n Key : 7\r\r\n Process : 4\r\r\n Section : 1\r\r\n Semaphore : 2\r\r\n Thread : 77\r\r\n TpWorkerFactory : 3\r\r\n WaitCompletionPacket: 7\r\r\n WindowStation : 2\r\r\nTotal handles: 471\r\r\n"
Handle.exe version 4.11 was installed as it is shown in the log.
The problem is it has slightly different output in versions prior to version 4.0, I suppose.
It starts with "Nthandle" whereas old versions start with "Handle" and Rabbit agent is looking exactly for "Handle" in output in rabbit_mgmt_external_stats.erl
This error seems to be linked to the sample retention policies initialization:
Config:
{sample_retention_policies,
[{global, [{60, 5}, {3600, 60}]},
{basic, [{60, 20}, {3600, 360}]},
{detailed, [{10, 5}]}]}
]},
Error in log:
"Error: {could_not_start,rabbitmq_management_agent,\n {{shutdown,\n {failed_to_start_child,rabbit_mgmt_agent_sup,\n {shutdown,\n {failed_to_start_child,\n channel_queue_exchange_metrics_metrics_collector,\n {function_clause,\n [{lists,min,[[]],[{file,"lists.erl"},{line,315}]},\n {rabbit_mgmt_metrics_collector,init,1,\n [{file,"src/rabbit_mgmt_metrics_collector.erl"},{line,73}]},\n {gen_server,init_it,6,[{file,"gen_server.erl"},{line,328}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,247}]}]}}}}},\n {rabbit_mgmt_agent_app,start,[normal,[]]}}}"
It used to work with 3.6.6.
Although most stats are properly cleaned up after queue/channel/... are deleted, there might be some corner cases missing. It is even possible that a race condition between consuming stats from core tables and deleting objects is happening, as we have seem some very small leaks of channel_created_stats
in some deployments.
On top of the current gc strategy, we should implement a periodic gc that verifies that stats belong to a living object on the system. Otherwise, they should be deleted.
See this thread: https://groups.google.com/forum/#!topic/rabbitmq-users/HKMIwURSW4c
Relevant line: https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_db_handler.erl#L64
RabbitMQ misc returns the Val
, not {ok, Val}
See deadtrickster/prometheus_rabbitmq_exporter#45
Sometimes on node restart the queue message stats (ready, unack...) are reported as NaN in the UI, and do not recover over time. This is a race condition on the boot step introduced by the agent to force stats collection when collect_statistics
is set to none
in rabbit
.
In this case, the enablement of rabbitmq_management_agent
plugin sets the parameter to fine
. However, it seems that this can sometimes happen after queues are recovered, and then they never emit stats.
Forcing an event refresh will solve this race condition on node restart, and the lack of stats when the plugin is enabled with collect_statistics
as none
forcing the collection to start immediately.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.