rabbitmq / rabbitmq-management-agent Goto Github PK

View Code? Open in Web Editor NEW

17.0 27.0 16.0 991 KB

RabbitMQ Management Agent

Home Page: https://www.rabbitmq.com/

License: Other

Makefile 7.75% Erlang 92.25%

rabbitmq rabbitmq-plugin

rabbitmq-management-agent's Issues

Queues with consumers with extra arguments fail to serialise to JSON

See rabbitmq/rabbitmq-server#1159. This is the same issue as #34 but in a different scenario.

Socket options that have nested data structure values will fail to serialise to JSON

See this rabbitmq-users thread.

Most of such options either cannot be sensibly serialised (e.g. their values are functions) or management UI and most users don't care about them appearing in GET /api/overview output.
So we should filter them out rather than trying to find a way to convert them to something that will serialise.

Could not find handle.exe, please install from sysinternals" message floods the log

I have referred the issue# 90, and it was fixed by increasing the interval time to 10 minutes, and available in version 3.7.24.

but I have installed rabbitmq 3.8.2 and erlang version "Erlang OTP 22 (10.6)", and I still facing the same issue, I am not deployed handle.exe from sysinternal, is there anyway to limit the log, or I have to do anything in config file.

[email protected]

External port crash in rabbit_mgmt_external_stats:init/1 can affect other processes

See this report. This is the same issue as in rabbitmq/rabbitmq-server#91, except in a different module.

The process should not stop in init/1, in fact, it should not attempt to retrieve the stats there, waiting for a moment is perfectly fine.

Force Event Refresh causes lots of discarded messages

Hi,

When RabbitMQ is slow to start up it may generate lots of discarding messages.

2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)

The cause of these messages is the following:

rabbitmq-management-agent/src/rabbit_mgmt_db_handler.erl

Lines 66 to 83 in 5bc5efd

 ensure_statistics_enabled() -> 

 ForceStats = rates_mode() =/= none, 

 handle_force_fine_statistics(), 

 {ok, StatsLevel} = application:get_env(rabbit, collect_statistics), 

 rabbit_log:info("Management plugin: using rates mode '~p'~n", [rates_mode()]), 

 case {ForceStats, StatsLevel} of 

 {true, fine} -> 

 ok; 

 {true, _} -> 

 application:set_env(rabbit, collect_statistics, fine); 

 {false, none} -> 

 application:set_env(rabbit, collect_statistics, coarse); 

 {_, _} -> 

 ok 

 end, 

 ok = rabbit:force_event_refresh(erlang:make_ref()). 

 %%----------------------------------------------------------------------------

Calling into this:

https://github.com/rabbitmq/rabbitmq-server/blob/d9ac4fba3cf82552da9c8bbedc74cfb9f2cfafda/src/rabbit_amqqueue.erl#L1357-L1363

Basically mnesia still has the old pids inside which causes these messages.

Before restart these were the queues in the system:

[{amqqueue,{resource,<<"/">>,queue,<<"durable-queue">>},
           true,false,none,[],<0.438.0>,[],[],[],undefined,undefined,
           [],[],live,0,[],<<"/">>,
           #{user => <<"guest">>}},
 {amqqueue,{resource,<<"/">>,queue,<<"test_queue">>},
           true,false,none,[],<0.441.0>,[],[],[],undefined,undefined,
           [],[],live,0,[],<<"/">>,
           #{user => <<"my_user">>}}]

After restart we get these errors:

2020-04-30 05:10:27.244 [info] <0.344.0> Management plugin: using rates mode 'basic'
2020-04-30 05:10:27.244 [info] <0.344.0> Running boot step recovery defined by app rabbit
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.441.0> in an old incarnation (2) of this node (3)

You can see that if this runs before the queues were properly restarted in a single node installation it may generate as many messages as there are queues. Not sure the reason but in some cases we can even see this in clustered situation, I guess after a full restart.

This can cause high memory usage, and I think we had in the past when there were multiple thousands of these messages and slow down start of RabbitMQ but now we managed to track this down.

I am not sure if this is a race condition between management agent starting up and starting of queues, or some other error but it is basically 100% reproducible on a Windows installation with 2 cores, 2 queues, in a VM on my machine.

I am not sure if there is a way to check if the process id is in the current incarnation, or maybe just check if the pid is alive is enough but it would be good to avoid these messages.
Also I am not sure if it causes any errors that the messages are actually lost in these cases.

Thanks

New configuration file format

Will plugin configuration be adapted to the new format?

Expose selected scheduler bind type as node metric

Expose the value returned by: erlang:system_info(scheduler_bind_type)

Consumer stats leak

Sudden stop of tens of thousands of consumers leaves orphan consumer stats on the DB.

Caused by the orelse condition in https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_gc.erl#L198, which should be an andalso. Thus, stats will be kept only if both the process and the entity still exist.

"Could not find handle.exe, please install from sysinternals" message floods the log

It appears every ~5 seconds = ~17000 times = 1.7MB per day and is pretty useless. And also, makes analyzing logs harder.

It is not always possible to install Sysinternals on client's machine and, as you know, it can't be distributed with RabbitMQ (https://docs.microsoft.com/en-us/sysinternals/license-faq)

Is it possible to emit this error only once, for the first time only? And require to restart RabbitMQ after Sysinternals installation?

* sometimes, without any explainable reason, we get "Could not infer the number of file handles used: badarg" instead.

badarith in rabbit_mgmt_metrics_collector:sum_entry/2

I have the following set up:

Rabbitmq cluster version 3.6.9 with 3 nodes in AWS EC2
Main client of rabbitmq cluster is Sensu with the followings: vhost '/sensu', user 'sensu', ha-policy '"^(results$|keepalives$)" '{"ha-mode":"all", "ha-sync-mode":"automatic"}'

So, sensu is able to use this cluster for its intended purpose. However, I do see the following error in sasl log every now and then:

=CRASH REPORT==== 28-Apr-2017::19:34:47 ===
  crasher:
    initial call: rabbit_mgmt_metrics_collector:init/1
    pid: <0.27508.15>
    registered_name: connection_coarse_metrics_metrics_collector
    exception exit: {badarith,
                        [{rabbit_mgmt_metrics_collector,sum_entry,2,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,550}]},
                         {rabbit_mgmt_metrics_collector,
                             '-insert_entry_op/4-fun-0-',2,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,529}]},
                         {dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
                         {dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
                         {dict,update,4,[{file,"dict.erl"},{line,318}]},
                         {rabbit_mgmt_metrics_collector,insert_entry_op,4,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,528}]},
                         {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                         {rabbit_mgmt_metrics_collector,aggregate_entry,4,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,213}]}]}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [rabbit_mgmt_agent_sup,rabbit_mgmt_agent_sup_sup,<0.362.0>]
    messages: []
    links: [<0.364.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 1848
  neighbours:

=SUPERVISOR REPORT==== 28-Apr-2017::19:34:47 ===
     Supervisor: {local,rabbit_mgmt_agent_sup}
     Context:    child_terminated
     Reason:     {badarith,
                     [{rabbit_mgmt_metrics_collector,sum_entry,2,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,550}]},
                      {rabbit_mgmt_metrics_collector,
                          '-insert_entry_op/4-fun-0-',2,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,529}]},
                      {dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
                      {dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
                      {dict,update,4,[{file,"dict.erl"},{line,318}]},
                      {rabbit_mgmt_metrics_collector,insert_entry_op,4,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,528}]},
                      {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                      {rabbit_mgmt_metrics_collector,aggregate_entry,4,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,213}]}]}
     Offender:   [{pid,<0.27508.15>},
                  {name,connection_coarse_metrics_metrics_collector},
                  {mfargs,
                      {rabbit_mgmt_metrics_collector,start_link,
                          [connection_coarse_metrics]}},
                  {restart_type,permanent},
                  {shutdown,30000},
                  {child_type,worker}]

I'm no expert in Erlang, but based on this piece of information it looks like it has something to do with this line:
https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_metrics_collector.erl#L550

My guess is the caller somehow passed an invalid type (not numeric) to sum_entry function. While this doesn't seem to cause any issue to our rabbitmq cluster, it's sort of misleading to see so many errors like this in the log file.

Let me know if you need any information.

Be more resilient when the number of used file descriptors cannot be computed

See rabbitmq/discussions#10 for the background. This is fairly common to see specifically on Windows. Whatever the root cause may be, the function that evaluates the metric should simply return 0 if it cannot proceed.

Reset old_aggr_stats

The reset of the stats db should also reset the old_aggr_stats stored in the collector processes.

Metric GC process terminates with badarg in 3.6.9

Management plugin config:

{rabbitmq_management,
  [{listener, [{port,     15672},
                {ssl,      false},
                {ssl_opts, [{cacertfile, "/etc/rabbitmq/ssl/cacert.pem"},
                            {certfile,   "/etc/rabbitmq/ssl/server.pem"},
                            {keyfile,    "/etc/rabbitmq/ssl/server.key"}]}]},
   {rates_mode, none},
   {sample_retention_policies,
    [{global,   [{60, 5}]},
     {basic,    [{60, 20}]},
     {detailed, [{10, 5}]}]}
  ]},

Log error:

=ERROR REPORT==== 30-Mar-2017::00:24:14 ===
** Generic server rabbit_mgmt_gc terminating
** Last message in was start_gc
** When Server state == {state,#Ref<0.0.1.100845>,120000}
** Reason for termination ==
** {badarg,[{erlang,node,[none],[]},
            {rabbit_misc,is_process_alive,1,
                         [{file,"src/rabbit_misc.erl"},{line,872}]},
            {rabbit_mgmt_gc,gc_process,3,
                            [{file,"src/rabbit_mgmt_gc.erl"},{line,136}]},
            {lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
            {ets,do_foldl,4,[{file,"ets.erl"},{line,585}]},
            {ets,foldl,3,[{file,"ets.erl"},{line,574}]},
            {rabbit_mgmt_gc,handle_info,2,
                            [{file,"src/rabbit_mgmt_gc.erl"},{line,44}]},
            {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,601}]}]}

api "/api/queues/vhost/name" 404

when i try to get an individual queue by api "/api/queues/vhost/name" .I get 404.
when i try to delete an individual queue by api "/api/queues/vhost/name" .I get 405.

Single atom proplist values break JSON serialisation

Reported on rabbitmq-users.

The output of new versions of Handle.exe has been changed

Running RabbitMQ version 3.7.8 on Windows

Got this message in log:
2018-09-24 01:24:21.086 [warning] <0.826.0> Could not parse handle.exe output: "\r\nNthandle v4.11 - Handle viewer\r\nCopyright (C) 1997-2017 Mark Russinovich\r\nSysinternals - www.sysinternals.com\r\n\r\nHandle type summary:\r\r\n ALPC Port : 1\r\r\n Desktop : 1\r\r\n Directory : 2\r\r\n EtwRegistration : 57\r\r\n Event : 181\r\r\n IoCompletion : 4\r\r\n IRTimer : 6\r\r\n Key : 116\r\r\n Key : 7\r\r\n Process : 4\r\r\n Section : 1\r\r\n Semaphore : 2\r\r\n Thread : 77\r\r\n TpWorkerFactory : 3\r\r\n WaitCompletionPacket: 7\r\r\n WindowStation : 2\r\r\nTotal handles: 471\r\r\n"
Handle.exe version 4.11 was installed as it is shown in the log.
The problem is it has slightly different output in versions prior to version 4.0, I suppose.

It starts with "Nthandle" whereas old versions start with "Handle" and Rabbit agent is looking exactly for "Handle" in output in rabbit_mgmt_external_stats.erl

Sample retention policies init failure with 3.6.9

This error seems to be linked to the sample retention policies initialization:

Config:

{sample_retention_policies,
[{global, [{60, 5}, {3600, 60}]},
{basic, [{60, 20}, {3600, 360}]},
{detailed, [{10, 5}]}]}
]},

Error in log:

"Error: {could_not_start,rabbitmq_management_agent,\n {{shutdown,\n {failed_to_start_child,rabbit_mgmt_agent_sup,\n {shutdown,\n {failed_to_start_child,\n channel_queue_exchange_metrics_metrics_collector,\n {function_clause,\n [{lists,min,[[]],[{file,"lists.erl"},{line,315}]},\n {rabbit_mgmt_metrics_collector,init,1,\n [{file,"src/rabbit_mgmt_metrics_collector.erl"},{line,73}]},\n {gen_server,init_it,6,[{file,"gen_server.erl"},{line,328}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,247}]}]}}}}},\n {rabbit_mgmt_agent_app,start,[normal,[]]}}}"

It used to work with 3.6.6.

Periodic garbage collection of stats

Although most stats are properly cleaned up after queue/channel/... are deleted, there might be some corner cases missing. It is even possible that a race condition between consuming stats from core tables and deleting objects is happening, as we have seem some very small leaks of channel_created_stats in some deployments.

On top of the current gc strategy, we should implement a periodic gc that verifies that stats belong to a living object on the system. Otherwise, they should be deleted.

rabbitmq-management-agent could prevent RabbitMQ to start

See this thread: https://groups.google.com/forum/#!topic/rabbitmq-users/HKMIwURSW4c

Relevant line: https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_db_handler.erl#L64

RabbitMQ misc returns the Val, not {ok, Val}

Queue metrics might not be reported on node restart and plugin enablement

See deadtrickster/prometheus_rabbitmq_exporter#45

Sometimes on node restart the queue message stats (ready, unack...) are reported as NaN in the UI, and do not recover over time. This is a race condition on the boot step introduced by the agent to force stats collection when collect_statistics is set to none in rabbit.

In this case, the enablement of rabbitmq_management_agent plugin sets the parameter to fine. However, it seems that this can sometimes happen after queues are recovered, and then they never emit stats.

Forcing an event refresh will solve this race condition on node restart, and the lack of stats when the plugin is enabled with collect_statistics as none forcing the collection to start immediately.

	ensure_statistics_enabled() ->
	ForceStats = rates_mode() =/= none,
	handle_force_fine_statistics(),
	{ok, StatsLevel} = application:get_env(rabbit, collect_statistics),
	rabbit_log:info("Management plugin: using rates mode '~p'~n", [rates_mode()]),
	case {ForceStats, StatsLevel} of
	{true, fine} ->
	ok;
	{true, _} ->
	application:set_env(rabbit, collect_statistics, fine);
	{false, none} ->
	application:set_env(rabbit, collect_statistics, coarse);
	{_, _} ->
	ok
	end,
	ok = rabbit:force_event_refresh(erlang:make_ref()).

	%%----------------------------------------------------------------------------

rabbitmq / rabbitmq-management-agent Goto Github PK

rabbitmq-management-agent's Issues

Recommend Projects

Recommend Topics

Recommend Org