Giter Club home page Giter Club logo

rabbitmq-management-agent's Issues

Force Event Refresh causes lots of discarded messages

Hi,

When RabbitMQ is slow to start up it may generate lots of discarding messages.

2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)

The cause of these messages is the following:

ensure_statistics_enabled() ->
ForceStats = rates_mode() =/= none,
handle_force_fine_statistics(),
{ok, StatsLevel} = application:get_env(rabbit, collect_statistics),
rabbit_log:info("Management plugin: using rates mode '~p'~n", [rates_mode()]),
case {ForceStats, StatsLevel} of
{true, fine} ->
ok;
{true, _} ->
application:set_env(rabbit, collect_statistics, fine);
{false, none} ->
application:set_env(rabbit, collect_statistics, coarse);
{_, _} ->
ok
end,
ok = rabbit:force_event_refresh(erlang:make_ref()).
%%----------------------------------------------------------------------------

Calling into this:

https://github.com/rabbitmq/rabbitmq-server/blob/d9ac4fba3cf82552da9c8bbedc74cfb9f2cfafda/src/rabbit_amqqueue.erl#L1357-L1363

Basically mnesia still has the old pids inside which causes these messages.

Before restart these were the queues in the system:

[{amqqueue,{resource,<<"/">>,queue,<<"durable-queue">>},
           true,false,none,[],<0.438.0>,[],[],[],undefined,undefined,
           [],[],live,0,[],<<"/">>,
           #{user => <<"guest">>}},
 {amqqueue,{resource,<<"/">>,queue,<<"test_queue">>},
           true,false,none,[],<0.441.0>,[],[],[],undefined,undefined,
           [],[],live,0,[],<<"/">>,
           #{user => <<"my_user">>}}]

After restart we get these errors:

2020-04-30 05:10:27.244 [info] <0.344.0> Management plugin: using rates mode 'basic'
2020-04-30 05:10:27.244 [info] <0.344.0> Running boot step recovery defined by app rabbit
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.438.0> in an old incarnation (2) of this node (3)
2020-04-30 05:10:27.244 [error] emulator Discarding message {'$gen_cast',{force_event_refresh,#Ref<0.2714273141.3436707841.245624>}} from <0.344.0> to <0.441.0> in an old incarnation (2) of this node (3)

You can see that if this runs before the queues were properly restarted in a single node installation it may generate as many messages as there are queues. Not sure the reason but in some cases we can even see this in clustered situation, I guess after a full restart.

This can cause high memory usage, and I think we had in the past when there were multiple thousands of these messages and slow down start of RabbitMQ but now we managed to track this down.

I am not sure if this is a race condition between management agent starting up and starting of queues, or some other error but it is basically 100% reproducible on a Windows installation with 2 cores, 2 queues, in a VM on my machine.

I am not sure if there is a way to check if the process id is in the current incarnation, or maybe just check if the pid is alive is enough but it would be good to avoid these messages.
Also I am not sure if it causes any errors that the messages are actually lost in these cases.

Thanks

"Could not find handle.exe, please install from sysinternals" message floods the log

It appears every ~5 seconds = ~17000 times = 1.7MB per day and is pretty useless. And also, makes analyzing logs harder.

It is not always possible to install Sysinternals on client's machine and, as you know, it can't be distributed with RabbitMQ (https://docs.microsoft.com/en-us/sysinternals/license-faq)

Is it possible to emit this error only once, for the first time only? And require to restart RabbitMQ after Sysinternals installation?

* sometimes, without any explainable reason, we get "Could not infer the number of file handles used: badarg" instead.

badarith in rabbit_mgmt_metrics_collector:sum_entry/2

I have the following set up:

  • Rabbitmq cluster version 3.6.9 with 3 nodes in AWS EC2

  • Main client of rabbitmq cluster is Sensu with the followings: vhost '/sensu', user 'sensu', ha-policy '"^(results$|keepalives$)" '{"ha-mode":"all", "ha-sync-mode":"automatic"}'

So, sensu is able to use this cluster for its intended purpose. However, I do see the following error in sasl log every now and then:

=CRASH REPORT==== 28-Apr-2017::19:34:47 ===
  crasher:
    initial call: rabbit_mgmt_metrics_collector:init/1
    pid: <0.27508.15>
    registered_name: connection_coarse_metrics_metrics_collector
    exception exit: {badarith,
                        [{rabbit_mgmt_metrics_collector,sum_entry,2,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,550}]},
                         {rabbit_mgmt_metrics_collector,
                             '-insert_entry_op/4-fun-0-',2,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,529}]},
                         {dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
                         {dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
                         {dict,update,4,[{file,"dict.erl"},{line,318}]},
                         {rabbit_mgmt_metrics_collector,insert_entry_op,4,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,528}]},
                         {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                         {rabbit_mgmt_metrics_collector,aggregate_entry,4,
                             [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                              {line,213}]}]}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [rabbit_mgmt_agent_sup,rabbit_mgmt_agent_sup_sup,<0.362.0>]
    messages: []
    links: [<0.364.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 1848
  neighbours:

=SUPERVISOR REPORT==== 28-Apr-2017::19:34:47 ===
     Supervisor: {local,rabbit_mgmt_agent_sup}
     Context:    child_terminated
     Reason:     {badarith,
                     [{rabbit_mgmt_metrics_collector,sum_entry,2,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,550}]},
                      {rabbit_mgmt_metrics_collector,
                          '-insert_entry_op/4-fun-0-',2,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,529}]},
                      {dict,update_bkt,4,[{file,"dict.erl"},{line,323}]},
                      {dict,on_bucket,3,[{file,"dict.erl"},{line,415}]},
                      {dict,update,4,[{file,"dict.erl"},{line,318}]},
                      {rabbit_mgmt_metrics_collector,insert_entry_op,4,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,528}]},
                      {lists,foldl,3,[{file,"lists.erl"},{line,1248}]},
                      {rabbit_mgmt_metrics_collector,aggregate_entry,4,
                          [{file,"src/rabbit_mgmt_metrics_collector.erl"},
                           {line,213}]}]}
     Offender:   [{pid,<0.27508.15>},
                  {name,connection_coarse_metrics_metrics_collector},
                  {mfargs,
                      {rabbit_mgmt_metrics_collector,start_link,
                          [connection_coarse_metrics]}},
                  {restart_type,permanent},
                  {shutdown,30000},
                  {child_type,worker}]

I'm no expert in Erlang, but based on this piece of information it looks like it has something to do with this line:
https://github.com/rabbitmq/rabbitmq-management-agent/blob/master/src/rabbit_mgmt_metrics_collector.erl#L550

My guess is the caller somehow passed an invalid type (not numeric) to sum_entry function. While this doesn't seem to cause any issue to our rabbitmq cluster, it's sort of misleading to see so many errors like this in the log file.

Let me know if you need any information.

Reset old_aggr_stats

The reset of the stats db should also reset the old_aggr_stats stored in the collector processes.

Metric GC process terminates with badarg in 3.6.9

Management plugin config:

{rabbitmq_management,
  [{listener, [{port,     15672},
                {ssl,      false},
                {ssl_opts, [{cacertfile, "/etc/rabbitmq/ssl/cacert.pem"},
                            {certfile,   "/etc/rabbitmq/ssl/server.pem"},
                            {keyfile,    "/etc/rabbitmq/ssl/server.key"}]}]},
   {rates_mode, none},
   {sample_retention_policies,
    [{global,   [{60, 5}]},
     {basic,    [{60, 20}]},
     {detailed, [{10, 5}]}]}
  ]},

Log error:

=ERROR REPORT==== 30-Mar-2017::00:24:14 ===
** Generic server rabbit_mgmt_gc terminating
** Last message in was start_gc
** When Server state == {state,#Ref<0.0.1.100845>,120000}
** Reason for termination ==
** {badarg,[{erlang,node,[none],[]},
            {rabbit_misc,is_process_alive,1,
                         [{file,"src/rabbit_misc.erl"},{line,872}]},
            {rabbit_mgmt_gc,gc_process,3,
                            [{file,"src/rabbit_mgmt_gc.erl"},{line,136}]},
            {lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
            {ets,do_foldl,4,[{file,"ets.erl"},{line,585}]},
            {ets,foldl,3,[{file,"ets.erl"},{line,574}]},
            {rabbit_mgmt_gc,handle_info,2,
                            [{file,"src/rabbit_mgmt_gc.erl"},{line,44}]},
            {gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,601}]}]}

api "/api/queues/vhost/name" 404

when i try to get an individual queue by api "/api/queues/vhost/name" .I get 404.
when i try to delete an individual queue by api "/api/queues/vhost/name" .I get 405.

The output of new versions of Handle.exe has been changed

Running RabbitMQ version 3.7.8 on Windows

Got this message in log:
2018-09-24 01:24:21.086 [warning] <0.826.0> Could not parse handle.exe output: "\r\nNthandle v4.11 - Handle viewer\r\nCopyright (C) 1997-2017 Mark Russinovich\r\nSysinternals - www.sysinternals.com\r\n\r\nHandle type summary:\r\r\n ALPC Port : 1\r\r\n Desktop : 1\r\r\n Directory : 2\r\r\n EtwRegistration : 57\r\r\n Event : 181\r\r\n IoCompletion : 4\r\r\n IRTimer : 6\r\r\n Key : 116\r\r\n Key : 7\r\r\n Process : 4\r\r\n Section : 1\r\r\n Semaphore : 2\r\r\n Thread : 77\r\r\n TpWorkerFactory : 3\r\r\n WaitCompletionPacket: 7\r\r\n WindowStation : 2\r\r\nTotal handles: 471\r\r\n"
Handle.exe version 4.11 was installed as it is shown in the log.
The problem is it has slightly different output in versions prior to version 4.0, I suppose.

It starts with "Nthandle" whereas old versions start with "Handle" and Rabbit agent is looking exactly for "Handle" in output in rabbit_mgmt_external_stats.erl

Sample retention policies init failure with 3.6.9

This error seems to be linked to the sample retention policies initialization:

Config:

{sample_retention_policies,
[{global, [{60, 5}, {3600, 60}]},
{basic, [{60, 20}, {3600, 360}]},
{detailed, [{10, 5}]}]}
]},

Error in log:

"Error: {could_not_start,rabbitmq_management_agent,\n {{shutdown,\n {failed_to_start_child,rabbit_mgmt_agent_sup,\n {shutdown,\n {failed_to_start_child,\n channel_queue_exchange_metrics_metrics_collector,\n {function_clause,\n [{lists,min,[[]],[{file,"lists.erl"},{line,315}]},\n {rabbit_mgmt_metrics_collector,init,1,\n [{file,"src/rabbit_mgmt_metrics_collector.erl"},{line,73}]},\n {gen_server,init_it,6,[{file,"gen_server.erl"},{line,328}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},{line,247}]}]}}}}},\n {rabbit_mgmt_agent_app,start,[normal,[]]}}}"

It used to work with 3.6.6.

Periodic garbage collection of stats

Although most stats are properly cleaned up after queue/channel/... are deleted, there might be some corner cases missing. It is even possible that a race condition between consuming stats from core tables and deleting objects is happening, as we have seem some very small leaks of channel_created_stats in some deployments.

On top of the current gc strategy, we should implement a periodic gc that verifies that stats belong to a living object on the system. Otherwise, they should be deleted.

Queue metrics might not be reported on node restart and plugin enablement

See deadtrickster/prometheus_rabbitmq_exporter#45

Sometimes on node restart the queue message stats (ready, unack...) are reported as NaN in the UI, and do not recover over time. This is a race condition on the boot step introduced by the agent to force stats collection when collect_statistics is set to none in rabbit.

In this case, the enablement of rabbitmq_management_agent plugin sets the parameter to fine. However, it seems that this can sometimes happen after queues are recovered, and then they never emit stats.

Forcing an event refresh will solve this race condition on node restart, and the lack of stats when the plugin is enabled with collect_statistics as none forcing the collection to start immediately.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.