rabbitmq / rabbitmq-autocluster Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aweber/rabbitmq-autocluster

242.0 24.0 54.0 866 KB

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x

License: BSD 3-Clause "New" or "Revised" License

Makefile 6.02% Erlang 92.99% Shell 0.15% Dockerfile 0.84%

rabbitmq clustering aws consul etcd dns kubernetes

rabbitmq-autocluster's Introduction

RabbitMQ Autocluster

What it Does

This plugin provides a mechanism for peer node discovery in RabbitMQ clusters. It also supports a few opinionated features around cluster formation and "permanently unavailable" node detection.

Note for RabbitMQ 3.7.x Users

Starting with RabbitMQ 3.7.0 this plugin was superseded by a new peer discovery subsystem built on the same ideas and supporting the same backends via separate plugins.

This plugin therefore is deprecated and should not be used by those running RabbitMQ 3.7.0 or a later version.

Supported Discovery Backends

Nodes using this plugin will discover its peers on boot and (optionally) register with one of the supported backends:

If at least one peer node has been discovered, cluster formation proceeds as usual, otherwise the node is considered to be the first one to come up and becomes the seed node.

To avoid a natural race condition around seed node "election" when a newly formed cluster first boots, peer discovery backends use either randomized delays or a locking mechanism.

Some backends support node health checks. Nodes not reporting their status periodically are considered to be in an errored state. If the user opts in, such nodes can be automatically removed from the cluster. This is useful for deployments that use AWS autoscaling groups or similar IaaS features, for example.

This plugin only covers cluster formation and does not change how RabbitMQ clusters operate once formed.

Note: This plugin is not a replacement for first-hand knowledge of how to manually create a RabbitMQ cluster. If you run into issues using the plugin, you should try and manually create the cluster in the same environment as you are trying to use the plugin in. For information on how to cluster RabbitMQ manually, please see the RabbitMQ documentation.

Current Maintainers

This plugin was originally developed by Gavin Roy at AWeber and is now co-maintained by several RabbitMQ core contributors. Parts of it were adopted into RabbitMQ core (as of 3.7.0).

Supported RabbitMQ Versions

There are three branches in this repository that target different RabbitMQ release series:

v3.6.x targets RabbitMQ 3.6.x (current stable RabbitMQ branch)
v3.7.x is compatible with RabbitMQ 3.7.x but this plugin was superseded by a new peer discovery subsystem built on the same ideas.
master is a development branch that's not of much use at the moment.

Please take this into account when building this plugin from source.

Please also note that key ideas of this plugin have been incorporated into RabbitMQ master branch and will be included into 3.7.0. This plugin therefore will become a collection of backends (e.g. AWS and etcd) rather than a wholesale alternative cluster formation implementation.

Supported Erlang Versions

This plugin requires Erlang/OTP 18.3 or later. Also see the RabbitMQ Erlang version requirements guide.

Binary Releases

Binary releases of autocluster can be found on the GitHub Releases page.

The most recent release is 0.10.0 that targets RabbitMQ 3.6.12 or later.

See release notes for details.

Installation

This plugin is installed the same way as other RabbitMQ plugins.

Place both autocluster-{version}.ez and the rabbitmq_aws-{version}.ez plugin files in the RabbitMQ plugins directory.
Enable the plugin, e.g. with rabbitmq-plugins enable autocluster --offline.
Configure the plugin.
Start the node.

Alternatively, there is a pre-built Docker Image available at on DockerHub as pivotalrabbitmq/rabbitmq-autocluster.

Note that plugin does not have a default backend configured. A little bit of configuration is therefore mandatory regardless of the backend used.

Configuration

General Settings
- Available Settings
- How to Configure Settings
AWS configuration
Consul configuration
DNS configuration
- Example Configuration
- Troubleshooting
etcd configuration
K8S configuration
- Kubernetes Setup

General settings

Configuration for the plugin can be set in two places: operating system environment variables or the rabbitmq.config file under the autocluster section.

Available Settings

The following settings are generic and used by most (or all) service discovery backends:

Backend Type: Which type of service discovery backend to use. One of aws, consul, dns, etcd or k8s.
Startup Delay: To prevent a race condition when creating a new cluster for the first time, the startup delay performs a random sleep that should cause nodes to start in a slightly random offset from each other. The setting lets you control the maximum value for the startup delay.
Failure Mode: What behavior to use when the node fails to cluster with an existing RabbitMQ cluster or during initialization of the autocluster plugin. The two valid options are ignore and stop.
Log Level: You can set the log level via the environment variable AUTOCLUSTER_LOG_LEVEL or the autocluster.autocluster_log_level key (see below).
Longname (FQDN) Support: This is a RabbitMQ environment variable setting that is used by the autocluster plugin as well. When set to true this will cause RabbitMQ and the autocluster plugin to use fully qualified names to identify nodes. For more information about the RABBITMQ_USE_LONGNAME environment variable, see the RabbitMQ documentation
Node Name: Like long node name support, node name is a RabbitMQ server setting that can be used together with this plugin. When set to true this will cause RabbitMQ and the autocluster plugin. The RABBITMQ_NODENAME environment variable explicitly sets the node name that is used to identify the node with RabbitMQ. The autocluster plugin will use this value when constructing the local part/name/prefix for all nodes in this cluster. For example, if RABBITMQ_NODENAME is set to bunny@rabbit1, bunny will be prefixed to all nodes discovered by the various backends. For more information about the RABBITMQ_NODENAME environment variable, see the RabbitMQ documentation. Note that some backends offer ways to dynamically compute node name (e.g. AWS, Consul), others assume that node names are preconfigured out-of-band and provided by the discovery service (e.g. DNS). In those cases it may or not be possible (or recommended) to use RABBITMQ_NODENAME.
Node Type: Define the type of node to join the cluster as. One of disc or ram. See the RabbitMQ Clustering Guide for more information.
Cluster Cleanup: Enables a periodic check that removes any nodes that are not alive in the cluster and no longer listed in the service discovery list. This is a destructive action that removes nodes from the cluster. Nodes that are flapping and removed will be re-added as if they were coming in new and their database, including any persisted messages will be gone. To use this feature, you must not only enable it with this flag, but also disable the "Cleanup Warn Only" flag. Added in v0.5
Note: This is an experimental feature and should be used with caution.
Cleanup Interval: If cluster cleanup is enabled, this is the interval that specifies how often to look for dead nodes to remove (in seconds). Added in v0.5
Cleanup Warn Only: If set, the plugin will only warn about nodes that it would cleanup and will not perform any destructive actions on the cluster. Added in v0.5
HTTP Proxy: If set, the given HTTP URL will be used as a proxy to connect to the service discovery backend.
HTTPS Proxy: If set, the given HTTPS URL will be used as a proxy to connect to the service discovery backend.
Proxy Exclusions: List of host names which shouldn't use any proxy.; When using environment variables, the NoProxy list must be provided as a comma separated string: PROXY_EXCLUSIONS="localhost, 127.0.0.1"

Setting	Environment Variable	Setting Key	Type	Default
Backend Type	`AUTOCLUSTER_TYPE`	`backend`	`atom`	`unconfigured`
Startup Delay	`AUTOCLUSTER_DELAY`	`startup_delay`	`integer`	`5`
Failure Mode	`AUTOCLUSTER_FAILURE`	`autocluster_failure`	`atom`	`ignore`
Log Level	`AUTOCLUSTER_LOG_LEVEL`	`autocluster_log_level`	`atom`	`info`
Longname	`RABBITMQ_USE_LONGNAME`		`bool`	`false`
Node Name	`RABBITMQ_NODENAME`		`string`	`rabbit@$HOSTNAME`
Node Type	`RABBITMQ_NODE_TYPE`	`node_type`	`atom`	`disc`
Cluster Cleanup	`AUTOCLUSTER_CLEANUP`	`cluster_cleanup`	`bool`	`false`
Cleanup Interval	`CLEANUP_INTERVAL`	`cleanup_interval`	`integer`	`60`
Cleanup Warn Only	`CLEANUP_WARN_ONLY`	`cleanup_warn_only`	`bool`	`true`

Environment Variable	Setting Key	Type	Default
`AWS_AUTOSCALING`	`aws_autoscaling`	`atom`	`false`
`AWS_EC2_TAGS`	`aws_ec2_tags`	`[string()]`
`AWS_USE_PRIVATE_IP`	`aws_use_private_ip`	`atom`	`false`

Environment Variable	Setting Key	Type	Default
`AWS_ACCESS_KEY_ID`	`aws_access_key`	`string`
`AWS_SECRET_ACCESS_KEY`	`aws_secret_key`	`string`
`AWS_DEFAULT_REGION`	`aws_ec2_region`	`string`	`us-east-1`
`AWS_DEFAULT_PROFILE`	N/A	`string`
`AWS_CONFIG_FILE`	N/A	`string`
`AWS_SHARED_CREDENTIALS_FILE`	N/A	`string`

Setting	Environment Variable	Setting Key	Type	Default
Consul Scheme	`CONSUL_SCHEME`	`consul_scheme`	`string`	`http`
Consul Host	`CONSUL_HOST`	`consul_host`	`string`	`localhost`
Consul Port	`CONSUL_PORT`	`consul_port`	`integer`	`8500`
Consul ACL Token	`CONSUL_ACL_TOKEN`	`consul_acl_token`	`string`
Service Name	`CONSUL_SVC`	`consul_svc`	`string`	`rabbitmq`
Service Address	`CONSUL_SVC_ADDR`	`consul_svc_addr`	`string`
Service Auto Address	`CONSUL_SVC_ADDR_AUTO`	`consul_svc_addr_auto`	`boolean`	`false`
Service Auto Address by NIC	`CONSUL_SVC_ADDR_NIC`	`consul_svc_addr_nic`	`string`
Service Port	`CONSUL_SVC_PORT`	`consul_svc_port`	`integer`	`5672`
Service TTL	`CONSUL_SVC_TTL`	`consul_svc_ttl`	`integer`	`30`
Service Tags	`CONSUL_SVC_TAGS`	`consul_svc_tags`	`list`	`[]`
Service unregistration timeout	`CONSUL_DEREGISTER_AFTER`	`consul_deregister_after`	`integer`	`60`
Consul Use Longname	`CONSUL_USE_LONGNAME`	`consul_use_longname`	`boolean`	`false`
Consul Domain	`CONSUL_DOMAIN`	`consul_domain`	`string`	`consul`
Include nodes that fail Consul health checks?	`CONSUL_INCLUDE_NODES_WITH_WARNINGS`	`consul_include_nodes_with_warnings`	`boolean`	`false`

Environment Variable	`AUTOCLUSTER_HOST`
Setting Key	`autocluster_host`
Data type	`string`
Default Value	`consul`

Setting	Environment Variable	Setting Key	Type	Default
etcd Scheme	`ETCD_SCHEME`	`etcd_scheme`	`list`	`http`
etcd Host	`ETCD_HOST`	`etcd_host`	`list`	`localhost`
etcd Port	`ETCD_PORT`	`etcd_port`	`int`	`2379`
etcd Key Prefix	`ETCD_PREFIX`	`etcd_prefix`	`list`	`rabbitmq`
etcd Node TTL	`ETCD_TTL`	`etcd_ttl`	`integer`	`30`

Setting	Environment Variable	Setting Key	Type	Default
K8S Scheme	`K8S_SCHEME`	`k8s_scheme`	`string`	`https`
K8S Host	`K8S_HOST`	`k8s_host`	`string`	`kubernetes.default.svc.cluster.local`
K8S Port	`K8S_PORT`	`k8s_port`	`integer`	`443`
K8S Token Path	`K8S_TOKEN_PATH`	`k8s_token_path`	`string`	`/var/run/secrets/kubernetes.io/serviceaccount/token`
K8S Cert Path	`K8S_CERT_PATH`	`k8s_cert_path`	`string`	`/var/run/secrets/kubernetes.io/serviceaccount/ca.crt`
K8S Namespace Path	`K8S_NAMESPACE_PATH`	`k8s_namespace_path`	`string`	`/var/run/secrets/kubernetes.io/serviceaccount/namespace`
K8S Service Name	`K8S_SERVICE_NAME`	`k8s_service_name`	`string`	`rabbitmq`
K8S Adddress Type	`K8S_ADDRESS_TYPE`	`k8s_address_type`	`string`	`ip`
K8S Hostname Suffix	`K8S_HOSTNAME_SUFFIX`	`k8s_hostname_suffix`	`string`

rabbitmq-autocluster's People

Contributors

Stargazers

Watchers

rabbitmq-autocluster's Issues

Unregistration with Consul uses incorrect ID

Consul's agent HTTP API requires a :service_id whereas our code uses

Service = string:join(["service", service_id()], ":")

[enhancement] add cleanup_failures for pruning dead nodes

I have enabled cluster_cleanup/cleanup_interval on my system for removing dead RabbitMQ nodes if they are failing health checks in a Consul cluster (backend=consul)

One issue I have experienced is the cleanup occurs immediately as soon as the check fails. This turns it into a kind of russian roulette for brief outages, where cleanup_interval doesn't matter if a 5 second outage happens to occur at the time that it has checked.

Suggest addition of cleanup_failures for the minimum number of consecutive failures before the node is pruned. While this won't remove the possibility of short outages causing early pruning (there could always be two short outages that happen right at consecutive checks) this does reduce the possibility of them happening.

Examples:

cleanup_interval=60,cleanup_failures=0 - check every 60sec, drop immediately (0=current behavior)
cleanup_interval=60,cleanup_failures=1 - check every 60sec, drop if failed two checks (60s+ outage)
cleanup_interval=60,cleanup_failures=2 - check every 60sec, drop if failed three checks (120s+ outage)
cleanup_interval=600,cleanup_failures=2 - check every 10min, drop if failed three checks (20min+ outage)

FYI for me: I am not running this in a production environment. I have increased cleanup_interval to 600 to reduce the chance of this occurring (and my needing to manually recycle nodes). I am aware I have the option to disable experimental features.

rabbitmq-autocluster failing

Have been at this for a few days:

/ # rabbitmqctl join_cluster [email protected] Clustering node '[email protected]' with '[email protected]' Error: {inconsistent_cluster,"Node '[email protected]' thinks it's clustered with node '[email protected]', but '[email protected]' disagrees"}

It doesn't seem that whatever I do I cannot get the second node to join properly nor have it join as a "disc" node. It continually joins as a ram node even when performing manually.

Here is the environment:

/ # env | grep RABBITMQ RABBITMQ_USE_LONGNAME=true RABBITMQ_DISK_FREE_LIMIT="8GiB" RABBITMQ_PORT_15672_TCP=tcp://10.110.213.110:15672 RABBITMQ_PORT_25672_TCP=tcp://10.110.213.110:25672 RABBITMQ_LOGS=- RABBITMQ_MANAGER_PORT_NUMBER=15672 [email protected] RABBITMQ_SERVICE_PORT_HTTP=15672 RABBITMQ_PLUGINS_EXPAND_DIR=/var/lib/rabbitmq/plugins RABBITMQ_PASSWORD=abc123 RABBITMQ_VERSION=3.6.14 RABBITMQ_PLUGINS_DIR=/usr/lib/rabbitmq/plugins RABBITMQ_SERVICE_HOST=10.110.213.110 RABBITMQ_SASL_LOGS=- RABBITMQ_NODE_TYPE=stats RABBITMQ_BASE=/rabbitmq RABBITMQ_PORT_5672_TCP_ADDR=10.110.213.110 RABBITMQ_PORT_4369_TCP_ADDR=10.110.213.110 RABBITMQ_SERVICE_PORT_EPMD=4369 RABBITMQ_SERVICE_PORT=15672 RABBITMQ_PORT=tcp://10.110.213.110:15672 RABBITMQ_PORT_5672_TCP_PORT=5672 RABBITMQ_PORT_5672_TCP_PROTO=tcp RABBITMQ_VHOST=/ RABBITMQ_PORT_4369_TCP_PORT=4369 RABBITMQ_PORT_4369_TCP_PROTO=tcp RABBITMQ_NODE_PORT_NUMBER=5672 RABBITMQ_PID_FILE=/var/lib/rabbitmq/rabbitmq.pid RABBITMQ_SERVER_ERL_ARGS=+K true +A128 +P 1048576 -kernel inet_default_connect_options [{nodelay,true}] RABBITMQ_PORT_15672_TCP_ADDR=10.110.213.110 RABBITMQ_SERVICE_PORT_AMQP=5672 RABBITMQ_PORT_25672_TCP_ADDR=10.110.213.110 RABBITMQ_MNESIA_DIR=/var/lib/rabbitmq/mnesia RABBITMQ_PORT_5672_TCP=tcp://10.110.213.110:5672 RABBITMQ_PORT_15672_TCP_PORT=15672 RABBITMQ_USERNAME=user RABBITMQ_HOME=/rabbitmq RABBITMQ_PORT_4369_TCP=tcp://10.110.213.110:4369 RABBITMQ_PORT_15672_TCP_PROTO=tcp RABBITMQ_PORT_25672_TCP_PORT=25672 RABBITMQ_PORT_25672_TCP_PROTO=tcp RABBITMQ_DIST_PORT=25672 RABBITMQ_SERVICE_PORT_DIST=25672

Mirroring

Hi,
This plugin appears to work only with the pivotal image and not the base rabbitmq 3.6.x image.
I also need to enable mirroring , but the pods ends up "PostHookError".
This is what I have in the kube spec YML file , can you please help me on this :-
( the yml formatting is all messed up with this editor, but otherwise it is well-formed )
[....]
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
sleep 1m && rabbitmqctl set_policy queue-mirror-ha ".*" '{"ha-mode":"all","ha-sync-mode": "automatic"}' --apply-to queues
[....]

Unable to configure using environment variables

I am not able to configure RMQ default user name, password and vhost name using environment variables. The container starts with default user name and password guest.

This is the command i used to run a container
docker run -d --name rabbitmq --net=host -p 4369:4369 -p 5672:5672 -p 15672:15672 -p 25672:25672 -e RABBITMQ_DEFAULT_PASS=<MyPassWord> -e RABBITMQ_DEFAULT_USER=<MyUserName> -e RABBITMQ_DEFAULT_VHOST=<MyVHost> -e AUTOCLUSTER_TYPE=aws -e AWS_AUTOSCALING=true -e AUTOCLUSTER_CLEANUP=true -e CLEANUP_WARN_ONLY=false -e AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION pivotalrabbitmq/rabbitmq-autocluster

Can some one help me if i miss something?

plugin not detected on ubuntu 18.04, rabbitmq 3.6.10

Hi,

I'm running Ubuntu 18.04 with RabbitMQ 3.6.10.

I have tried copying the *.ez files to /usr/lib/rabbitmq/plugins (this folder does not exist on default apt install. I have also tried copying the *.ez files to /usr/lib/rabbitmq/plugins/rabbitmq_server-3.6.10/plugins folder.

In both instances when I run rabbitmq-plugins list the plugins don't show up in the list which mean when I run the rabbitmq-plugins enable command it fails.

Is there something I'm missing? I have tried for v0.10.0 and v0.8.0 of the plugins.

autocluster: Step maybe_cluster failed with failure: inconsistent_cluster

while I doing load testing against 3 node rabbitmq cluster in kubernetes env, one rabbitmq pod(10.244.1.40) got restarted, and failed to rejoin the cluster.

Below is the logs it reported, which complained that "Node '[email protected]' thinks it's clustered with node '[email protected]', but '[email protected]' disagrees"

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Running step find_best_node_to_join
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoints/rabbitmq
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Wed, 15 Nov 2017 15:16:55 GMT"},
                              {"content-length","1024"},
                              {"content-type","application/json"}],
                             "{\"kind\":\"Endpoints\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"rabbitmq\",\"namespace\":\"wxgigo\",\"selfLink\":\"/api/v1/namespaces/wxgigo/endpoints/rabbitmq\",\"uid\":\"211b434a-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"196162\",\"creationTimestamp\":\"2017-11-15T14:49:06Z\",\"labels\":{\"app\":\"rabbitmq\"}},\"subsets\":[{\"addresses\":[{\"ip\":\"10.244.2.37\",\"hostname\":\"rabbitmq-0\",\"nodeName\":\"kube-node1\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-0\",\"uid\":\"21508476-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"193957\"}},{\"ip\":\"10.244.2.38\",\"hostname\":\"rabbitmq-2\",\"nodeName\":\"kube-node1\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-2\",\"uid\":\"3a7c7d39-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"194047\"}}],\"notReadyAddresses\":[{\"ip\":\"10.244.1.40\",\"hostname\":\"rabbitmq-1\",\"nodeName\":\"kube-node2\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-1\",\"uid\":\"2dd9f567-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"196160\"}}],\"ports\":[{\"name\":\"amqp\",\"port\":5672,\"protocol\":\"TCP\"}]}]}\n"}}]
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: k8s endpoint listing returned nodes not yet ready: 10.244.1.40
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: List of registered nodes retrieved from the backend: ['[email protected]',
                                                                   '[email protected]']

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Fetching node details. Unreachable nodes (or nodes that responded with an error): []
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Fetching node details. Responses: [{candidate_seed_node,
                                                 '[email protected]',1620526,
                                                 true,
                                                 ['[email protected]',
                                                  '[email protected]',
                                                  '[email protected]'],
                                                 ['[email protected]',
                                                  '[email protected]'],
                                                 [],[]},
                                                {candidate_seed_node,
                                                 '[email protected]',1663139,
                                                 true,
                                                 ['[email protected]',
                                                  '[email protected]',
                                                  '[email protected]'],
                                                 ['[email protected]',
                                                  '[email protected]'],
                                                 [],[]}]

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Asked to choose preferred node from the list of: [{candidate_seed_node,
                                                                '[email protected]',
                                                                1620526,true,
                                                                ['[email protected]',
                                                                 '[email protected]',
                                                                 '[email protected]'],
                                                                ['[email protected]',
                                                                 '[email protected]'],
                                                                [],[]},
                                                               {candidate_seed_node,
                                                                '[email protected]',
                                                                1663139,true,
                                                                ['[email protected]',
                                                                 '[email protected]',
                                                                 '[email protected]'],
                                                                ['[email protected]',
                                                                 '[email protected]'],
                                                                [],[]}]
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Filtered node list (does not include us and non-running/reachable nodes): [{candidate_seed_node,
                                                                                         '[email protected]',
                                                                                         1663139,
                                                                                         true,
                                                                                         ['[email protected]',
                                                                                          '[email protected]',
                                                                                          '[email protected]'],
                                                                                         ['[email protected]',
                                                                                          '[email protected]'],
                                                                                         [],
                                                                                         []},
                                                                                        {candidate_seed_node,
                                                                                         '[email protected]',
                                                                                         1620526,
                                                                                         true,
                                                                                         ['[email protected]',
                                                                                          '[email protected]',
                                                                                          '[email protected]'],
                                                                                         ['[email protected]',
                                                                                          '[email protected]'],
                                                                                         [],
                                                                                         []}]

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Picked node as the preferred choice for joining: '[email protected]'
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Running step maybe_cluster
=ERROR REPORT==== 15-Nov-2017::15:16:55 ===
Node '[email protected]' thinks it's clustered with node '[email protected]', but '[email protected]' disagrees
=ERROR REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Step maybe_cluster failed, will conitnue nevertheless. Failure reason: Failed to cluster with [email protected]: {inconsistent_cluster,[78,111,100,101,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,50,46,51,55,39,32,116,104,105,110,107,115,32,105,116,39,115,32,99,108,117,115,116,101,114,101,100,32,119,105,116,104,32,110,111,100,101,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,49,46,52,48,39,44,32,98,117,116,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,49,46,52,48,39,32,100,105,115,97,103,114,101,101,115]}.

Later, I cluster 10.244.1.40 with 10.244.2.37 successfully. So what's could be the possible reason? Is it possible rabbitmq cluster still not kick off the bad node(CLEANUP) from it's cluster info when the node try to rejoin again? Is it reduce the value of CLEANUP_INTERVAL helpful?

AWS: Use PTR records for discovered instances

AWS plugin returns privateDNS field or ip address as hostname for discovered instances in cluster. PrivateDNS always has ip-xx format. It would be nice to have a way to use own hostnames for instances discovered with AWS plugin. The best way I see is private ip name lookup instead of using privateDNS fields.

[error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404

Describe the bug:
I've used the configuration in minikube,and I have this problem:

2021-02-18 08:14:52.827 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:52.831 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 9 retries left...
2021-02-18 08:14:53.337 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:53.340 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 8 retries left...
2021-02-18 08:14:53.847 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:53.851 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 7 retries left...
2021-02-18 08:14:54.357 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:54.360 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 6 retries left...
2021-02-18 08:14:54.867 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:54.870 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 5 retries left...
2021-02-18 08:14:55.375 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:55.378 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 4 retries left...
2021-02-18 08:14:55.885 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:55.888 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 3 retries left...
2021-02-18 08:14:56.395 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:56.398 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 2 retries left...
2021-02-18 08:14:56.905 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:56.907 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 1 retries left...
2021-02-18 08:14:57.414 [error] <0.272.0> Failed to fetch a list of nodes from Kubernetes API: 404
2021-02-18 08:14:57.416 [error] <0.272.0> Peer discovery returned an error: "404". Will retry after a delay of 500 ms, 0 retries left...

BOOT FAILED
===========
Exception during startup:

    rabbit_boot_steps:run_boot_steps/1 line 20
    rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
    rabbit_boot_steps:run_step/2 line 46
    rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 41
    rabbit_mnesia:init/0 line 76
    rabbit_mnesia:init_with_lock/3 line 111
    rabbit_mnesia:run_peer_discovery_with_retries/2 line 145
    rabbit_mnesia:run_peer_discovery_with_retries/2 line 138
error:{badmatch,ok}

2021-02-18 08:14:57.920 [info] <0.44.0> Application mnesia exited with reason: stopped
2021-02-18 08:14:57.921 [error] <0.272.0> 
2021-02-18 08:14:57.921 [info] <0.44.0> Application mnesia exited with reason: stopped
2021-02-18 08:14:57.921 [error] <0.272.0> BOOT FAILED
2021-02-18 08:14:57.921 [error] <0.272.0> ===========
2021-02-18 08:14:57.921 [error] <0.272.0> Exception during startup:
2021-02-18 08:14:57.922 [error] <0.272.0> 
2021-02-18 08:14:57.922 [error] <0.272.0>     rabbit_boot_steps:run_boot_steps/1 line 20
2021-02-18 08:14:57.922 [error] <0.272.0>     rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
2021-02-18 08:14:57.922 [error] <0.272.0>     rabbit_boot_steps:run_step/2 line 46
2021-02-18 08:14:57.922 [error] <0.272.0>     rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 41
2021-02-18 08:14:57.923 [error] <0.272.0>     rabbit_mnesia:init/0 line 76
2021-02-18 08:14:57.923 [error] <0.272.0>     rabbit_mnesia:init_with_lock/3 line 111
2021-02-18 08:14:57.923 [error] <0.272.0>     rabbit_mnesia:run_peer_discovery_with_retries/2 line 145
2021-02-18 08:14:57.923 [error] <0.272.0>     rabbit_mnesia:run_peer_discovery_with_retries/2 line 138
2021-02-18 08:14:57.923 [error] <0.272.0> error:{badmatch,ok}
2021-02-18 08:14:57.923 [error] <0.272.0> 
2021-02-18 08:14:58.925 [info] <0.271.0> [{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.271.0>},{registered_name,[]},{error_info,{exit,{{badmatch,ok},{rabbit,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}},{ancestors,[<0.270.0>]},{message_queue_len,1},{messages,[{'EXIT',<0.272.0>,normal}]},{links,[<0.270.0>,<0.44.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,28},{reductions,354}], []
2021-02-18 08:14:58.925 [error] <0.271.0> CRASH REPORT Process <0.271.0> with 0 neighbours exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2021-02-18 08:14:58.926 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}}
2021-02-18 08:14:58.926 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{badmatch,ok},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{badmatch,ok},{rabbit,start,[normal,[]]}}})

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Rename the plugin name to rabbitmq_autocluster

The plugin name does not follow the standard names.

The current name is autocluster and should be renamed to rabbitmq_autocluster.

Docker swarm mode native service discovery

Excuse my ignorance on how autocluster works under the hood, but are there any plans to support using the newer docker swarm mode's native service discovery?

How is RABBITMQ_NODENAME value used by rabbitmq-autocluster?

Problem description

I have started two rabbitmq docker containers in two AWS EC2 different instances in sa-east-1 region using AWS ECS to manage them. I've constructed the image and I'm pretty sure that this plugin was correctly installed (this info could be confirmed by the logs bellow).

I need to use RABBITMQ_NODENAME to setup the rabbitmq node name to a defined hostname because I can't use docker container with host networking.

For testing, you should start a rabbitmq 3.6.x container with rabbitmq-autocluster plugin version 0.10.0 using the following run command in two EC2 instances tagged with env=socialbase:

docker run -ti --name rmq --hostname rabbitmq \
    -e AUTOCLUSTER_CLEANUP=true \
    -e AUTOCLUSTER_DELAY=30 \
    -e AUTOCLUSTER_LOG_LEVEL=debug \
    -e AUTOCLUSTER_TYPE=aws \
    -e AWS_DEFAULT_REGION=sa-east-1 \
    -e AWS_EC2_TAGS='{"env": "socialbase"}' \
    -e CLEANUP_WARN_ONLY=false \
    -e RABBITMQ_ERLANG_COOKIE=xxx \
    -e RABBITMQ_NODENAME=rabbit@rabbitmq \
  rabbitmq-image-name

Here are the details of my env:

Environment variables

/ # printenv
AUTOCLUSTER_LOG_LEVEL=debug
HOSTNAME=rabbitmq
AWS_EC2_TAGS={"env": "socialbase"}
SHLVL=2
HOME=/var/lib/rabbitmq
RABBITMQ_LOGS=-
RABBITMQ_NODENAME=rabbit@rabbitmq
RABBITMQ_ERLANG_COOKIE=xxx
AUTOCLUSTER_TYPE=aws
S6_FILE=s6-overlay-amd64.tar.gz
AWS_DEFAULT_REGION=sa-east-1
RABBITMQ_GPG_KEY=0A9AF2115F4687BD29803A206B73A36E6026DFCA
RABBITMQ_VERSION=3.6.12
TERM=xterm
AUTOCLUSTER_CLEANUP=true
AUTOCLUSTER_DELAY=30
RABBITMQ_SASL_LOGS=-
PATH=/opt/rabbitmq/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CONFD_FILE=confd-0.11.0-linux-amd64.gz
BIN_PATH=/usr/local/bin
PWD=/
RABBITMQ_HOME=/opt/rabbitmq
CLEANUP_WARN_ONLY=false
DOCKERIZE_FILE=dockerize-linux-amd64-v0.2.0.tar.gz
RABBITMQ_GITHUB_TAG=rabbitmq_v3_6_12

RabbitMQ/Erlang Version

/ # rabbitmqctl status
Status of node rabbit@rabbitmq
[{pid,2222},
 {running_applications,
     [{autocluster,
          "Forms RabbitMQ clusters using a variety of backends (AWS EC2, DNS, Consul, Kubernetes, etc)",
          "0.9.0+4.g0e7899d"},
      {rabbitmq_management,"RabbitMQ Management Console","3.6.12"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.12"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.12"},
      {rabbitmq_delayed_message_exchange,"RabbitMQ Delayed Message Exchange",
          "0.0.1"},
      {rabbit,"RabbitMQ","3.6.12"},
      {mnesia,"MNESIA  CXC 138 12","4.13.4"},
      {amqp_client,"RabbitMQ AMQP Client","3.6.12"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.6.12"},
      {cowboy,"Small, fast, modular HTTP server.","1.0.4"},
      {xmerl,"XML parser","1.3.10"},
      {os_mon,"CPO  CXC 138 46","2.4"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
      {rabbitmq_aws,
          "A minimalistic AWS API interface used by rabbitmq-autocluster (3.6.x) and other RabbitMQ plugins",
          "3.6.13.milestone1+2.g946e794"},
      {ssl,"Erlang/OTP SSL application","7.3.1"},
      {public_key,"Public key infrastructure","1.1.1"},
      {asn1,"The Erlang ASN1 compiler version 4.0.2","4.0.2"},
      {compiler,"ERTS  CXC 138 10","6.0.3"},
      {cowlib,"Support library for manipulating Web protocols.","1.0.2"},
      {crypto,"CRYPTO","3.6.3"},
      {syntax_tools,"Syntax tools","1.7"},
      {inets,"INETS  CXC 138 49","6.2.2"},
      {sasl,"SASL  CXC 138 11","2.7"},
      {stdlib,"ERTS  CXC 138 10","2.8"},
      {kernel,"ERTS  CXC 138 10","4.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"},
 {memory,
     [{connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,2712},
      {queue_procs,2712},
      {queue_slave_procs,0},
      {plugins,1895896},
      {other_proc,23448304},
      {metrics,194360},
      {mgmt_db,142288},
      {mnesia,68456},
      {other_ets,2247272},
      {binary,123032},
      {msg_index,40864},
      {code,28026822},
      {atom,1033377},
      {other_system,14351505},
      {total,71577600}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]},
 {vm_memory_calculation_strategy,rss},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,429496729},
 {disk_free_limit,50000000},
 {disk_free,6930915328},
 {file_descriptors,
     [{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]},
 {processes,[{limit,1048576},{used,334}]},
 {run_queue,0},
 {uptime,187},
 {kernel,{net_ticktime,60}}]

RabbitMQ server and client application log files

=INFO REPORT==== 31-Oct-2017::10:21:16 ===
Starting RabbitMQ 3.6.12 on Erlang 18.3.2
Copyright (C) 2007-2017 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

              RabbitMQ 3.6.12. Copyright (C) 2007-2017 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: tty
  ######  ##        tty
  ##########
              Starting broker...

=INFO REPORT==== 31-Oct-2017::10:21:16 ===
node           : rabbit@rabbitmq
home dir       : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : Q8goAtoH0kfw25pdUQxb2Q==
log            : tty
sasl log       : tty
database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: log level set to debug

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Running discover/join step

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
    application: mnesia
    exited: stopped
    type: temporary

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Apps 'rabbit' and 'mnesia' successfully stopped

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Running step initialize_backend

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Using AWS backend

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Starting dependencies of backend aws: [rabbitmq_aws]

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Running step acquire_startup_lock

=INFO REPORT==== 31-Oct-2017::10:21:18 ===
autocluster: Delaying startup for 27737ms.

=INFO REPORT==== 31-Oct-2017::10:21:45 ===
autocluster: Running step find_best_node_to_join

=INFO REPORT==== 31-Oct-2017::10:21:45 ===
autocluster: Setting region: "sa-east-1"

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: AWS request: /?Action=DescribeInstances&Filter.1.Name=tag%3Aenv&Filter.1.Value.1=socialbase&Version=2015-10-01
Response: [{"DescribeInstancesResponse",
            [{"requestId","f9c2ee3f-4292-4515-a1c5-735021b18161"},
             {"reservationSet",
              [{"item",
                [{"reservationId","r-0c26f5305400263de"},
                 {"ownerId","832266673134"},
                 {"groupSet",[]},
                 {"instancesSet",
                  [{"item",
                    [{"instanceId","i-01b5692949213d3c9"},
                     {"imageId","ami-ae0971c2"},
                     {"instanceState",[{"code","16"},{"name","running"}]},
                     {"privateDnsName",
                      "ip-10-0-3-20.sa-east-1.compute.internal"},
                     {"dnsName",
                      "ec2-54-233-190-88.sa-east-1.compute.amazonaws.com"},
                     {"reason",[]},
                     {"keyName","mysshkey"},
                     {"amiLaunchIndex","0"},
                     {"productCodes",[]},
                     {"instanceType","c3.xlarge"},
                     {"launchTime","2017-10-31T09:11:59.000Z"},
                     {"placement",
                      [{"availabilityZone","sa-east-1c"},
                       {"groupName",[]},
                       {"tenancy","default"}]},
                     {"monitoring",[{"state","disabled"}]},
                     {"subnetId","subnet-b90123ff"},
                     {"vpcId","vpc-656bd800"},
                     {"privateIpAddress","10.0.3.20"},
                     {"ipAddress","54.233.190.88"},
                     {"sourceDestCheck","true"},
                     {"groupSet",
                      [{"item",
                        [{"groupId","sg-88d9a5ef"},{"groupName","SG_ECS"}]}]},
                     {"architecture","x86_64"},
                     {"rootDeviceType","ebs"},
                     {"rootDeviceName","/dev/xvda"},
                     {"blockDeviceMapping",
                      [{"item",
                        [{"deviceName","/dev/xvda"},
                         {"ebs",
                          [{"volumeId","vol-05afac15ca8df17b4"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:12:00.000Z"},
                           {"deleteOnTermination","true"}]}]},
                       {"item",
                        [{"deviceName","/dev/xvdcz"},
                         {"ebs",
                          [{"volumeId","vol-0be274acfd288e5b6"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:12:00.000Z"},
                           {"deleteOnTermination","true"}]}]}]},
                     {"instanceLifecycle","spot"},
                     {"spotInstanceRequestId","sir-2m4rdafm"},
                     {"virtualizationType","hvm"},
                     {"clientToken","e308a29e-27fc-4d9d-8f78-c4dd35f38d2d"},
                     {"tagSet",
                      [{"item",
                        [{"key","weave:peerGroupName"},
                         {"value","socialbase"}]},
                       {"item",
                        [{"key","aws:ec2spot:fleet-request-id"},
                         {"value",
                          "sfr-25ac2bb7-6c67-42b5-b2b0-26683c7f6496"}]},
                       {"item",[{"key","ssh_user"},{"value","ec2-user"}]},
                       {"item",[{"key","ssh_port"},{"value","22"}]},
                       {"item",[{"key","Name"},{"value","ecs"}]},
                       {"item",
                        [{"key","ssh_key"},{"value","mysshkey.pem"}]},
                       {"item",[{"key","env"},{"value","socialbase"}]}]},
                     {"hypervisor","xen"},
                     {"networkInterfaceSet",
                      [{"item",
                        [{"networkInterfaceId","eni-33bdd52b"},
                         {"subnetId","subnet-b90123ff"},
                         {"vpcId","vpc-656bd800"},
                         {"description",[]},
                         {"ownerId","832266673134"},
                         {"status","in-use"},
                         {"macAddress","0a:57:74:cc:cf:4c"},
                         {"privateIpAddress","10.0.3.20"},
                         {"privateDnsName",
                          "ip-10-0-3-20.sa-east-1.compute.internal"},
                         {"sourceDestCheck","true"},
                         {"groupSet",
                          [{"item",
                            [{"groupId","sg-88d9a5ef"},
                             {"groupName","SG_ECS"}]}]},
                         {"attachment",
                          [{"attachmentId","eni-attach-6ef50185"},
                           {"deviceIndex","0"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:11:59.000Z"},
                           {"deleteOnTermination","true"}]},
                         {"association",
                          [{"publicIp","54.233.190.88"},
                           {"publicDnsName",
                            "ec2-54-233-190-88.sa-east-1.compute.amazonaws.com"},
                           {"ipOwnerId","amazon"}]},
                         {"privateIpAddressesSet",
                          [{"item",
                            [{"privateIpAddress","10.0.3.20"},
                             {"privateDnsName",
                              "ip-10-0-3-20.sa-east-1.compute.internal"},
                             {"primary","true"},
                             {"association",
                              [{"publicIp","54.233.190.88"},
                               {"publicDnsName",
                                "ec2-54-233-190-88.sa-east-1.compute.amazonaws.com"},
                               {"ipOwnerId","amazon"}]}]}]}]}]},
                     {"iamInstanceProfile",
                      [{"arn",
                        "arn:aws:iam::832266673134:instance-profile/ecsInstanceRole"},
                       {"id","AIPAJOCNFNPLL25WXF3DQ"}]},
                     {"ebsOptimized","false"},
                     {"enaSupport","true"}]}]},
                 {"requesterId","AIDAJKWGGSFI5CGBVYWOY"}]},
               {"item",
                [{"reservationId","r-038d132ce42f84dec"},
                 {"ownerId","832266673134"},
                 {"groupSet",[]},
                 {"instancesSet",
                  [{"item",
                    [{"instanceId","i-03972fdcc1fef7506"},
                     {"imageId","ami-ae0971c2"},
                     {"instanceState",[{"code","16"},{"name","running"}]},
                     {"privateDnsName",
                      "ip-10-0-1-162.sa-east-1.compute.internal"},
                     {"dnsName",
                      "ec2-54-233-153-55.sa-east-1.compute.amazonaws.com"},
                     {"reason",[]},
                     {"keyName","mysshkey"},
                     {"amiLaunchIndex","0"},
                     {"productCodes",[]},
                     {"instanceType","c3.xlarge"},
                     {"launchTime","2017-10-31T09:43:08.000Z"},
                     {"placement",
                      [{"availabilityZone","sa-east-1a"},
                       {"groupName",[]},
                       {"tenancy","default"}]},
                     {"monitoring",[{"state","disabled"}]},
                     {"subnetId","subnet-9018def5"},
                     {"vpcId","vpc-656bd800"},
                     {"privateIpAddress","10.0.1.162"},
                     {"ipAddress","54.233.153.55"},
                     {"sourceDestCheck","true"},
                     {"groupSet",
                      [{"item",
                        [{"groupId","sg-88d9a5ef"},{"groupName","SG_ECS"}]}]},
                     {"architecture","x86_64"},
                     {"rootDeviceType","ebs"},
                     {"rootDeviceName","/dev/xvda"},
                     {"blockDeviceMapping",
                      [{"item",
                        [{"deviceName","/dev/xvda"},
                         {"ebs",
                          [{"volumeId","vol-01a8d67bf15d79b4c"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:43:08.000Z"},
                           {"deleteOnTermination","true"}]}]},
                       {"item",
                        [{"deviceName","/dev/xvdcz"},
                         {"ebs",
                          [{"volumeId","vol-0864bd115f76fc03d"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:43:08.000Z"},
                           {"deleteOnTermination","true"}]}]}]},
                     {"instanceLifecycle","spot"},
                     {"spotInstanceRequestId","sir-j8sgctnm"},
                     {"virtualizationType","hvm"},
                     {"clientToken","a8fe9bf3-069a-4075-b8ea-bb8e3391aa7e"},
                     {"tagSet",
                      [{"item",[{"key","env"},{"value","socialbase"}]},
                       {"item",[{"key","Name"},{"value","ecs"}]},
                       {"item",
                        [{"key","aws:ec2spot:fleet-request-id"},
                         {"value",
                          "sfr-25ac2bb7-6c67-42b5-b2b0-26683c7f6496"}]},
                       {"item",[{"key","ssh_user"},{"value","ec2-user"}]},
                       {"item",
                        [{"key","weave:peerGroupName"},
                         {"value","socialbase"}]},
                       {"item",
                        [{"key","ssh_key"},{"value","mysshkey.pem"}]},
                       {"item",[{"key","ssh_port"},{"value","22"}]}]},
                     {"hypervisor","xen"},
                     {"networkInterfaceSet",
                      [{"item",
                        [{"networkInterfaceId","eni-0b052c07"},
                         {"subnetId","subnet-9018def5"},
                         {"vpcId","vpc-656bd800"},
                         {"description",[]},
                         {"ownerId","832266673134"},
                         {"status","in-use"},
                         {"macAddress","02:ab:16:6e:be:2e"},
                         {"privateIpAddress","10.0.1.162"},
                         {"privateDnsName",
                          "ip-10-0-1-162.sa-east-1.compute.internal"},
                         {"sourceDestCheck","true"},
                         {"groupSet",
                          [{"item",
                            [{"groupId","sg-88d9a5ef"},
                             {"groupName","SG_ECS"}]}]},
                         {"attachment",
                          [{"attachmentId","eni-attach-e8929901"},
                           {"deviceIndex","0"},
                           {"status","attached"},
                           {"attachTime","2017-10-31T09:43:08.000Z"},
                           {"deleteOnTermination","true"}]},
                         {"association",
                          [{"publicIp","54.233.153.55"},
                           {"publicDnsName",
                            "ec2-54-233-153-55.sa-east-1.compute.amazonaws.com"},
                           {"ipOwnerId","amazon"}]},
                         {"privateIpAddressesSet",
                          [{"item",
                            [{"privateIpAddress","10.0.1.162"},
                             {"privateDnsName",
                              "ip-10-0-1-162.sa-east-1.compute.internal"},
                             {"primary","true"},
                             {"association",
                              [{"publicIp","54.233.153.55"},
                               {"publicDnsName",
                                "ec2-54-233-153-55.sa-east-1.compute.amazonaws.com"},
                               {"ipOwnerId","amazon"}]}]}]}]}]},
                     {"iamInstanceProfile",
                      [{"arn",
                        "arn:aws:iam::832266673134:instance-profile/ecsInstanceRole"},
                       {"id","AIPAJOCNFNPLL25WXF3DQ"}]},
                     {"ebsOptimized","false"},
                     {"enaSupport","true"}]}]},
                 {"requesterId","AIDAJKWGGSFI5CGBVYWOY"}]}]}]}]


=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: List of registered nodes retrieved from the backend: ['rabbit@ip-10-0-1-162',
                                                                   'rabbit@ip-10-0-3-20']

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Fetching node details. Unreachable nodes (or nodes that responded with an error): ['rabbit@ip-10-0-1-162',
                                                                                                'rabbit@ip-10-0-3-20']

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Fetching node details. Responses: []

=ERROR REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: No nodes to choose the preferred from!

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Picked node as the preferred choice for joining: undefined

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Running step maybe_cluster

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: We are the first node in the cluster, starting up unconditionally.

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Starting back 'rabbit' application

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Memory high watermark set to 409 MiB (429496729 bytes) of 1024 MiB (1073741824 bytes) total

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Enabling free disk space monitoring

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Disk free limit set to 50MB

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Starting rabbit_node_monitor

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Running registeration step

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Running step register_with_backend

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Running lock release step

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: Running step release_startup_lock

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Management plugin started. Port: 15672

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Statistics database started.

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
autocluster: (cleanup) Timer started {60,false}
 completed with 9 plugins.

=INFO REPORT==== 31-Oct-2017::10:21:46 ===
Server startup complete; 9 plugins started.
 * autocluster
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_delayed_message_exchange
 * amqp_client
 * cowboy
 * rabbitmq_aws
 * cowlib

RabbitMQ plugin information via `rabbitmq-plugins list`

/ # rabbitmq-plugins list
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status:   * = running on rabbit@rabbitmq
 |/
[e*] amqp_client                       3.6.12
[E*] autocluster                       0.9.0+4.g0e7899d
[e*] cowboy                            1.0.4
[e*] cowlib                            1.0.2
[  ] rabbitmq_amqp1_0                  3.6.12
[  ] rabbitmq_auth_backend_ldap        3.6.12
[  ] rabbitmq_auth_mechanism_ssl       3.6.12
[e*] rabbitmq_aws                      3.6.13.milestone1+2.g946e794
[  ] rabbitmq_consistent_hash_exchange 3.6.12
[E*] rabbitmq_delayed_message_exchange 0.0.1
[  ] rabbitmq_event_exchange           3.6.12
[  ] rabbitmq_federation               3.6.12
[  ] rabbitmq_federation_management    3.6.12
[  ] rabbitmq_jms_topic_exchange       3.6.12
[E*] rabbitmq_management               3.6.12
[e*] rabbitmq_management_agent         3.6.12
[  ] rabbitmq_management_visualiser    3.6.12
[  ] rabbitmq_mqtt                     3.6.12
[  ] rabbitmq_recent_history_exchange  3.6.12
[  ] rabbitmq_sharding                 3.6.12
[  ] rabbitmq_shovel                   3.6.12
[  ] rabbitmq_shovel_management        3.6.12
[  ] rabbitmq_stomp                    3.6.12
[  ] rabbitmq_top                      3.6.12
[  ] rabbitmq_tracing                  3.6.12
[  ] rabbitmq_trust_store              3.6.12
[e*] rabbitmq_web_dispatch             3.6.12
[  ] rabbitmq_web_mqtt                 3.6.12
[  ] rabbitmq_web_mqtt_examples        3.6.12
[  ] rabbitmq_web_stomp                3.6.12
[  ] rabbitmq_web_stomp_examples       3.6.12
[  ] sockjs                            0.3.4

Operating system, version, and patch level

Alpine Linux 3.4

rabbitmq-collect-env

https://drive.google.com/file/d/0B7odw6Q-9mFdSzFZWDl5eFBRZ1k/view?usp=sharing

The rabbitmq state has only the current server

[root@k8smaster k8s_statefulsets]# FIRST_POD=$(kubectl get pods --namespace test-rabbitmq -l 'app=rabbitmq' -o jsonpath='{.items[0].metadata.name }')
[root@k8smaster k8s_statefulsets]# kubectl exec --namespace=test-rabbitmq $FIRST_POD rabbitmqctl cluster_status
Cluster status of node '[email protected]'
[{nodes,[{disc,['[email protected]']}]},
{running_nodes,['[email protected]']},
{cluster_name,<<"[email protected]">>},
{partitions,[]},
{alarms,[{'[email protected]',[]}]}]

The rabbitmq state has only the current server, and does not show the state of the entire cluster. What's the problem?thank you

Plugin not compatible to consul 1.0.0

Hi, I've updated consul and noticed that consul was rejecting the service registrarion with this message:

[ERR] http: Request POST /v1/agent/service/register, error: method POST not allowed from=127.0.0.1:34213

I've checked the consul changelog (https://github.com/hashicorp/consul/blob/master/CHANGELOG.md) and they now are accepting only PUT (not POST) for this endpoint.

As per https://github.com/aweber/rabbitmq-autocluster/blob/a8271e8d71b38dd917957aee0f4bd35d055f43f6/src/autocluster_consul.erl#L88 - rabbitmq-autocluster is sending a POST.

Can we change this for PUT to make it compatible?

Autocluster attempts to create a session in Consul before session endpoint ready

We're seeing this when bringing up a RabbitMQ cluster on the same nodes that act as Consul servers: when the autocluster plugin attempts to connect to the session endpoint to create a lock, in order to overcome the race condition on startup, Consul returns a 500 when the session endpoint is unavailable, and as a result the RabbitMQ server does not start at all.

We've been able to work around this by creating a script which polls the session endpoint in Consul until it's available, and then in our Puppet manifests, we ensure this script runs first before the RabbitMQ server is started.

Ideally, we'd expect that the autocluster plugin would poll for the availability of the session endpoint before creating a session/lock, and retry - without this, the Rabbit daemon doesn't start properly, so it feels like this is something that the autocluster plugin should be doing, to make sure Consul is completely ready before attempting to create the lock/session.

Nodes fail to communicate with peers in AWS Autoscaling Group

I've been trying to get this plugin working for a few days now and cannot seem to get it to create a cluster. Please let me know what I'm doing wrong or if there's a legit bug in the plugin.

I'm running RabbitMQ within a Docker container, hosted on EC2 instances in an AutoScaling Group. There is only one container running on each server.

The attached zip file has the Dockerfile and resources it needs to build.

rabbit-autocluster-docker.zip

My instances use the following User Data script to configure Rabbit as a systemd service (on CentOS 7).

#!/bin/bash
mkdir /root/.docker
chmod 700 /root/.docker
aws configure set default.s3.signature_version s3v4
aws s3 cp s3://my-config-bucket/docker-config-for-private-registry.json /root/.docker/config.json
chmod 600 /root/.docker/config.json

cat >> /etc/systemd/system/rabbit-docker.service <<EOF
[Unit]
Description=RabbitMQ Docker Container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=/usr/bin/docker volume create --name=rabbit-data
ExecStartPre=/usr/bin/docker pull my.private.registry/myco/rabbit:dev
ExecStart=/usr/bin/docker run --name rabbitmq \
                              --log-driver=awslogs \
                              --log-opt awslogs-region=us-east-1 \
                              --log-opt awslogs-group=/rabbit \
                              --log-opt awslogs-stream=${HOSTNAME} \
                              -p 4369:4369 \
                              -p 5671:5671 \
                              -p 5672:5672 \
                              -p 15672:15672 \
                              -p 25672:25672 \
                              -v rabbit-data:/var/lib/rabbitmq \
                              -e ERLANG_COOKIE=BananaChocolateChip_TryIt_Really_ItsGood \
                              -e RABBITMQ_NODENAME=rabbit@${HOSTNAME}.mydomain.com \
                              -e RABBITMQ_USE_LONGNAME=true \
                              -e AUTOCLUSTER_DELAY=10 \
                              -e AUTOCLUSTER_LOG_LEVEL=debug \
                              -e AUTOCLUSTER_CLEANUP=true \
                              -e CLEANUP_WARN_ONLY=false \
                              -e AWS_AUTOSCALING=true \
                              -e AWS_EC2_TAGS={\"Name\":\"rabbit-autocluster-test\"} \
                              -e AWS_USE_PRIVATE_IP=false \
                              --network host \
                              my.private.registry/myco/rabbit:dev
ExecStop=/usr/bin/docker stop -t 2 rabbitmq
ExecStopPost=/usr/bin/docker rm -f rabbitmq

[Install]
WantedBy=default.target
EOF

systemctl daemon-reload
systemctl start rabbit-docker.service
systemctl enable rabbit-docker.service

The instances are launched in a VPC, with a private subnet connected to a NAT Gateway. The mydomain.com DNS is managed in Route53 and has both forward and reverse lookup entries for all the IP addresses in the subnet (e.g. ip-192-168-205-21.mydomain.com. A 192.168.205.21, 21.205.168.192.in-addr.arpa. PTR ip-192-168-205-21.mydomain.com).

The security group for the nodes allows access to the following ports to all members of the security group:

1883
4369
5671
5672
8883
15672
15674
15675
25672
61613
61614

It also allows traffic on 5672 and 15672 from an ELB (classic) used by clients to connect to the cluster and management ports.

I'm seeing the following error log after the plugin retrieves the Nodes list from AWS:

=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Fetching autoscaling = DNS: ["ip-192-168-205-21.ec2.internal"]
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Registering node with aws.
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Registered node with aws.
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Discovered ['[email protected]']
=ERROR REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Can not communicate with cluster nodes.

This occurs when looking up nodes by autoscaling group, tag only, or a combination of the two.

Please support microsoft service fabric on linux and maybe windows too

Service fabric supports running on Linux, it can run docker containers, it has a service discovery service named the naming service much like consul and it can even run docker compose files and it even has it's own DNS service that can facilitate service discovery based on the state in the naming service. Please add this is a supported backend, it would be amazing.

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-docker-compose
https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-dnsservice
https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-connect-and-communicate-with-services

0.9.0 is incompatible with official rabbitmq Docker image

RabbitMQ version: 3.6.12
Erlang version: 19.2.1

When running version 0.9.0 in the official rabbitmq:3.6.12-management Docker image, it fails to start with this error:

Failed to enable plugin "autocluster": it may have been built with an incompatible (more recent?) version of Erlang

Would it be possible to release a version of 0.9.0 that is compatible with this Docker image?

Add startup lock support for Consul backend

I actually began working on this one on the back of #6, and it is sort of coming together, hence opening an issue to check if there is any work lined up or done already to avoid duplicated effort. Otherwise happy to submit for an early review in a few days.

AWS instance cannot create cluster with other nodes within the same AWS autoscaling group

Hi all,
I face a problem while trying to cluster two nodes that belong to the same autoscaling group.

I have two AWS instances (Centos7) within the same AWS autoscaling group and each instance has RabbitMQ 3.6.10 with Erlang/OTP 20 installed. I also installed and enabled the rabbitmq-autocluster plugin 0.8.0

Here's the rabbitmq.config file in both instances:

[
{rabbit, [
{autocluster_log_level, info}
]},
{autocluster, [
{backend, aws},
{aws_autoscaling, true},
{aws_ec2_region, "eu-west-1"},
{aws_access_key, "my_access_key"},
{aws_secret_key, "my_secret_access_key"}
]}
].

I start the first RMQ server in the first instance (rabbit@ip-172-31-20-113). It creates its own single-node cluster as expected.

BUT, when I start the RMQ server in the second instance (rabbit@ip-172-31-16-139) it does not get clustered with the first instance although it recognizes that both of them belong to the same autoscaling group.
Here's the rabbitmq log from the second RMQ server (rabbit@ip-172-31-16-139):

=INFO REPORT==== 28-Sep-2017::08:32:30 ===
autocluster: List of registered nodes retrieved from the backend: ['rabbit@ip-172-31-20-113', 'rabbit@ip-172-31-16-139'] -----> As you can see autocluster plugin retrieved the nodes from the scaling group.

=ERROR REPORT==== 28-Sep-2017::08:32:30 ===
autocluster: No nodes to choose the preferred from!

=INFO REPORT==== 28-Sep-2017::08:32:30 ===
autocluster: Picked node as the preferred choice for joining: undefined

=INFO REPORT==== 28-Sep-2017::08:32:30 ===
autocluster: Running step maybe_cluster

=INFO REPORT==== 28-Sep-2017::08:32:30 ===
autocluster: We are the first node in the cluster, starting up unconditionally.

Why doesn't the 2nd instance choose to enter the 1st instance cluster?

I would appreciate any help!

Cannot get list of discovered service from consul

I have a setup where the consul agent is running on my host machine and I am using pivotal/rabbitmq docker image with autocluster on.

My docker compose looks like below

version: "2"
services:
   rabbit:
    environment:
      - TCP_PORTS=15672, 5672,25672,4369,8500
      - AUTOCLUSTER_TYPE=consul
      - AUTOCLUSTER_DELAY=60
      - CONSUL_HOST=localhost
      - CONSUL_ACL_TOKEN=b7862315-05fe-4dda-8b13-4f533ec4e205
      - CONSUL_SVC=terracotta_rabbitmq
      - AUTOCLUSTER_CLEANUP=true
      - CLEANUP_WARN_ONLY=false
      - CONSUL_DEREGISTER_AFTER=60
      - AUTOCLUSTER_LOG_LEVEL=debug
      - CONSUL_SVC_ADDR_AUTO=true
      - CONSUL_SVC_ADDR_NODENAME=true
    image: pivotalrabbitmq/rabbitmq-autocluster
    expose:
      - 15672
      - 5672
      - 5671
      - 15671
      - 33431
      - 8300
      - 4369
      - 25672
      - 8500
    tty: true
    network_mode: host
    command:  sh -c "sleep 20; rabbitmq-server;"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Strangely, on the step where it checks for services that are passing on consul, it gets empty list when queried from inside docker autocluster plugin.

bit_1_303964dae0b0 | autocluster: GET http://localhost:8500/v1/health/service/terracotta_rabbitmq
rabbit_1_303964dae0b0 | 
rabbit_1_303964dae0b0 | =INFO REPORT==== 9-Jan-2019::07:35:00 ===
rabbit_1_303964dae0b0 | autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
rabbit_1_303964dae0b0 |                              [{"date","Wed, 09 Jan 2019 07:35:00 GMT"},
rabbit_1_303964dae0b0 |                               {"content-length","2"},
rabbit_1_303964dae0b0 |                               {"content-type","application/json"},
rabbit_1_303964dae0b0 |                               {"x-consul-index","26252192"},
rabbit_1_303964dae0b0 |                               {"x-consul-knownleader","true"},
rabbit_1_303964dae0b0 |                               {"x-consul-lastcontact","0"}],
rabbit_1_303964dae0b0 |                              "[]"}}]

But when i curl from inside the docker sh, i get the list of services correctly. Because it is not able to identify passing services from consul, the cluster isnt forming.

RabbittMQ Server takes about 3-4 minutes to start with autocluster enabled

I am trying to setup a rmq cluster with autocluster plugin. This is what I am observing:

With version 0.7.0 : rmq server takes about 3-4 minutes to start with this plugin enabled.
With version 0.6.1 : Works perfectly fine!

The only difference I see b/w 0.6.1 and 0.7.0 is that 0.7.0 requires autocluster_aws plugin as a dependency. I also see that with 0.7.0 (with 3-4 min delay), I see the following error log:

=ERROR REPORT==== 19-May-2017::02:22:26 ===
Failed to retrieve AWS credentials: undefined

This is with the backend configured as etcd. Not sure why aws is checked for.

Below is my config (along with RABBITMQ_USE_LONGNAME=true ):

[
  {rabbit, [
    {tcp_listen_options, [
                          {backlog,       128},
                          {nodelay,       true},
                          {linger,        {true,0}},
                          {exit_on_close, false},
                          {sndbuf,        12000},
                          {recbuf,        12000}
                         ]},
    {loopback_users, []}
  ]},

  {autocluster, [
    {dummy_param_without_comma, true},
    {backend, etcd},
    {autocluster_failure, stop},
    {cleanup_interval, 30},
    {cluster_cleanup, true},
    {cleanup_warn_only, false},
    {etcd_ttl, 30},
    {etcd_scheme, http},
    {etcd_host, "etcd.kube-system.svc.cluster.local"},
    {etcd_port, 2379}
   ]}
].

rabbitmq-autocluster does not function with consul 1.0.0 due to API changes

rabbitmq-autocluster does not function with consul 1.0.0 due to API changes, roll back to 0.9.3 and it works as expected

RabbitMQ version: 3.6.12
Erlang version: 19.3
I'm using docker services for both consul & rabbitmq-autocluster

docker stack deploy -c rabbit-cluster.yml rabbit-cluster

rabbit-cluster.yml

version: '3'

#customize this with options from
#https://www.consul.io/docs/agent/options.html

services:
  consul:
    hostname: consul
    # image: consul:0.9.3 #Works
    image: consul:1.0.0 # Does not Work
    deploy:
      replicas: 1
    environment:
      - "CONSUL_LOCAL_CONFIG={\"disable_update_check\": true}"
      - "CONSUL_BIND_INTERFACE=eth0"
      - "CONSUL_HTTP_ADDR=0.0.0.0"
    entrypoint:
      - consul
      - agent
      - -server
      - -bootstrap-expect=1
      - -data-dir=/tmp/consuldata
      - -bind={{ GetInterfaceIP "eth0" }}
      - -client=0.0.0.0
      - -ui
    networks:
      - "consul"
    ports:
      - "8500:8500"
      - "8600:8600"

  rabbit:
    depends_on:
      - "consul"
    environment:
      - AUTOCLUSTER_TYPE=consul
      - CONSUL_HOST=consul
      - CONSUL_PORT=8500
      - CONSUL_SERVICE_TTL=60
      - AUTOCLUSTER_CLEANUP=true
      - CLEANUP_WARN_ONLY=false
      - CONSUL_SVC_ADDR_AUTO=true
      - CONSUL_DEREGISTER_AFTER=60
    networks:
      - "consul"
    image: rabbitmq-autocluster
    ports:
      - "15672:15672"
    tty: true
    command:  sh -c "sleep 20; rabbitmq-server;"

networks:
  consul:
    driver: overlay

rabbitmq log reports
“autocluster: Step register_with_backend failed, will conitnue nevertheless. Failure reason: Failed to register in backend: 405.”

consul log reports
[ERR] http: Request POST /v1/agent/service/register, error: method POST not allowed from=10.0.1.5:39340

From https://www.consul.io/docs/upgrade-specific.html it appears Consul now requires PUT not POST for /v1/agent/service/register as well as a number of other endpoints.

Handle "notReadyAddresses" in kubernetes

Using the StatefulSets the API:

curl -v --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/test-rabbitmq/endpoints/rabbitmq

gets this result:

 "subsets": [
    {
      "notReadyAddresses": [
        {
          "ip": "172.17.0.2",
          "hostname": "rabbitmq-0",
          "nodeName": "minikube",
          "targetRef": {
            "kind": "Pod",
            "namespace": "test-rabbitmq",
            "name": "rabbitmq-0",
            "uid": "3523b6ad-3a17-11e7-9fac-080027cbdcae",
            "resourceVersion": "108775"
          }
        }
      ],

notReadyAddresses means that the POD is starting and it is not ready yet.

During the startup It causes:

=INFO REPORT==== 16-May-2017::07:23:26 ===
Error description:
   {could_not_start,rabbit,
       {function_clause,
           [{autocluster_k8s,'-extract_node_list/1-lc$^1/1-1-',
                [undefined],
                [{file,"src/autocluster_k8s.erl"},{line,85}]},
            {autocluster_k8s,'-extract_node_list/1-lc$^0/1-0-',1,
                [{file,"src/autocluster_k8s.erl"},{line,85}]},
            {autocluster_k8s,extract_node_list,1,
                [{file,"src/autocluster_k8s.erl"},{line,85}]},
            {autocluster_k8s,nodelist,0,
                [{file,"src/autocluster_k8s.erl"},{line,30}]},
            {autocluster,ensure_registered,3,
                [{file,"src/autocluster.erl"},{line,107}]},
            {autocluster,init,0,[{file,"src/autocluster.erl"},{line,33}]},
            {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
                [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
            {rabbit_boot_steps,run_step,2,
                [{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}

Log files (may contain more information):
   tty
   tty

init terminating in do_boot ()

{"init terminating in do_boot",{could_not_start,rabbit,{function_clause,[{autocluster_k8s,'-extract_node_list/1-lc$^1/1-1-',[undefined],[{file,"src/autocluster_k8s.erl"},{line,85}]},{autocluster_k8s,'-extract_node_list/1-lc$^0/1-0-',1,[{file,"src/autocluster_k8s.erl"},{line,85}]},{autocluster_k8s,extract_node_list,1,[{file,"src/autocluster_k8s.erl"},{line,85}]},{autocluster_k8s,nodelist,0,[{file,"src/autocluster_k8s.erl"},{line,30}]},{autocluster,ensure_registered,3,[{file,"src/autocluster.erl"},{line,107}]},{autocluster,init,0,[{file,"src/autocluster.erl"},{line,33}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,"src/rabbit_boot_steps.erl"},{line,49}]}]}}}

Because proplists:get_value(<<"addresses">>, Subset) is undefined

The possible solution is to return an empty list nodes in case is undefined:

get_address(Subset) ->
  case proplists:get_value(<<"addresses">>, Subset) of 
      undefined -> autocluster_log:info("No nodes ready yet!"), []; 
      Address -> Address  
  end.

%% @spec extract_node_list(k8s_endpoints()) -> list()
%% @doc Return a list of nodes
%%    see http://kubernetes.io/docs/api-reference/v1/definitions/#_v1_endpoints
%% @end
%%
extract_node_list({struct, Response}) ->
    IpLists = [[proplists:get_value(list_to_binary(autocluster_config:get(k8s_address_type)), Address)
		|| {struct, Address} <- get_address(Subset)]
	       || {struct, Subset} <- proplists:get_value(<<"subsets">>, Response)],
    sets:to_list(sets:union(lists:map(fun sets:from_list/1, IpLists))).

When serviceName changed, the cluster doesn`t work.

I modify the value "serviceName" in statefulset yaml and "name" in service yaml together.
example both are "rabbitmq2", like this:

metadata: name: rabbitmq namespace: test-rabbitmq spec: serviceName: rabbitmq2 replicas: 3 template: metadata: labels: app: rabbitmq

kind: Service apiVersion: v1 metadata: namespace: test-rabbitmq name: rabbitmq2 labels: app: rabbitmq type: LoadBalancer

then the rabbitmq node can not found each other. Just when the value is "rabbitmq", the cluster working fine.

Docker image tag

Hey guys, can we please have the docker image tagged with the version of the release on GitHub? We don't want to keep pulling the latest image in case it breaks.

Statefulset example does not work

Statefulset example does not work. Not sure why. Containers / pods start up fine, but do not join cluster.

K8S: Unauthorized access while connecting to api server

I try to configure autocluster plugin with k8s connection type.
I use receipts from yaml-files in example folder to deploy cluster.
Also I use configuration file from the same git repositotry.

In my installation I use rabbitmq 3.6.10 and the latest plugin 0.8.0 to deploy a cluster.

While deploying I have a next output in log files of every node, which are idependent from each other%

INFO REPORT==== 4-Aug-2017::18:07:42 ===
autocluster: (cleanup) Timer started {60,false}

=INFO REPORT==== 4-Aug-2017::18:07:42 ===
Starting RabbitMQ 3.6.10 on Erlang 19.3.6.1
Copyright (C) 2007-2017 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 4-Aug-2017::18:07:42 ===
node           : [email protected]
home dir       : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : JJZMksGwEn9ChnNLQbQTdw==
log            : /var/log/rabbitmq/[email protected]
sasl log       : /var/log/rabbitmq/[email protected]
database dir   : /var/lib/rabbitmq/mnesia/[email protected]

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Running discover/join step

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Apps 'rabbit' and 'mnesia' successfully stopped

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Running step initialize_backend

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Using k8s backend

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Running step acquire_startup_lock

=INFO REPORT==== 4-Aug-2017::18:07:43 ===
autocluster: Delaying startup for 7380ms.

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: Running step find_best_node_to_join

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: GET https://kubernetes.default:443/api/v1/namespaces/rabbitmq/endpoints/rabbitmq

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: Response: [{ok,{{"HTTP/1.1",401,"Unauthorized"},
                             [{"date","Fri, 04 Aug 2017 12:07:50 GMT"},
                              {"content-length","13"},
                              {"content-type","text/plain; charset=utf-8"},
                              {"x-content-type-options","nosniff"}],
                             "Unauthorized\n"}}]

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: HTTP Response (401) Unauthorized


=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: Failed to get nodes from k8s - 401

=ERROR REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: Step find_best_node_to_join failed, will conitnue nevertheless. Failure reason: Failed to fetch list of nodes from the backend: "401".

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
autocluster: Starting back 'rabbit' application

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
Memory limit set to 12577MB of 15721MB total.

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
Enabling free disk space monitoring

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
Disk free limit set to 50MB

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
Limiting to approx 1048476 file handles (943626 sockets)

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
FHC read buffering:  OFF
FHC write buffering: OFF

=INFO REPORT==== 4-Aug-2017::18:07:50 ===
Database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Initialising from scratch...

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Starting rabbit_node_monitor

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 4-Aug-2017::18:07:51 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Adding vhost '/'

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Creating user 'guest'

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Setting user tags for user 'guest' to [administrator]

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
started TCP Listener on 0.0.0.0:5672

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
autocluster: Running registeration step

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
autocluster: Running step register_with_backend

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
autocluster: Running lock release step

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
autocluster: Running step release_startup_lock

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Management plugin started. Port: 15672

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Statistics database started.

=INFO REPORT==== 4-Aug-2017::18:07:51 ===
Server startup complete; 11 plugins started.
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * rabbitmq_sharding
 * rabbitmq_shovel
 * cowboy
 * amqp_client
 * autocluster
 * rabbitmq_aws
 * cowlib

=INFO REPORT==== 4-Aug-2017::18:08:42 ===
autocluster: (cleanup) checking cluster

=INFO REPORT==== 4-Aug-2017::18:08:42 ===
autocluster: (cleanup) Checking for partitioned nodes.

=INFO REPORT==== 4-Aug-2017::18:08:42 ===
autocluster: (cleanup) No partitioned nodes found.

=INFO REPORT==== 4-Aug-2017::18:09:42 ===
autocluster: (cleanup) checking cluster

=INFO REPORT==== 4-Aug-2017::18:09:42 ===
autocluster: (cleanup) Checking for partitioned nodes.

=INFO REPORT==== 4-Aug-2017::18:09:42 ===
autocluster: (cleanup) No partitioned nodes found.

I have got a message from autocluster plugin about an unauthorized access:

autocluster: HTTP Response (401) Unauthorized

Also when I try to curl this address I have got the same message from kubernetes api server:

$ curl https://kubernetes.default:443/api/v1/namespaces/rabbitmq/endpoints/rabbitmq --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Unauthorized

I use environment values to deploy a statefullset:

         - name: K8S_CERT_PATH
            value: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
          - name: K8S_TOKEN_PATH
            value: "/var/run/secrets/kubernetes.io/serviceaccount/token"

These ca.crt and token surely exist and has actual content!

Is there an any way to provide also client certificate and its private key to get authenticated successfully via kubernetes api? Or may may be has an another way to resolve this issue?

Plugin re-activation can fail

When enabling the plugin on the first node in a cluster the following error is generated.

[root@db3 ~]# rabbitmq-plugins enable autocluster
The following plugins have been enabled:
  rabbitmq_aws
  autocluster

Applying plugin configuration to [email protected]... failed.
Error: {{badmatch,false},
        [{autocluster_periodic,start_delayed,3,
                               [{file,"src/autocluster_periodic.erl"},
                                {line,47}]},
         {autocluster_consul,register,0,
                             [{file,"src/autocluster_consul.erl"},{line,135}]},
         {autocluster,register_with_backend,1,
                      [{file,"src/autocluster.erl"},{line,307}]},
         {autocluster,run_steps,1,[{file,"src/autocluster.erl"},{line,131}]},
         {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
         {rabbit_boot_steps,run_step,2,
                            [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
         {rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,26}]},
         {rabbit_boot_steps,run_boot_steps,1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,26}]}]}

Investigating this seems to point to a race condition in the logic, The plugin runs through the steps initially and acquires a lock and possibly inserts into ets, then on determining that it is the only node goes through the steps again while still holding the initial lock.

Retaining the initial lock itself could be a problem, after the initial lock is released the process then proceeds to register with consul, after registration the issue then arises when setting up the delayed task.

It seems the initial run may have already created this, thus a false is returned from ets:insert_new whose documentation[1] says a false is returned if keys are already exist.

In this state, if you then disable the plugin, it keeps running and trying to recreate the node in consul. I suspect the delayed task is not removed.

A spiral loop is then entered because consul returns 500 result code when the a request is made to check the state of a service that does not exist.

https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L194
https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L212

The only way to get out of this is to restart the rabbitmq-server process.

I do not have much time at the moment to dig through this so i have created a workaround[2] to stop the enable error until i have time later. If someone else is able to look it to this it would be great.

[1] http://erlang.org/doc/man/ets.html#insert_new-2
[2] akissa@39e0cb4

Does`n work in k8s 1.6.3

Cluster status of node [email protected] ... [ {nodes,[{disc,['[email protected]']}]} ,{running_nodes,['[email protected]']} ,{cluster_name,<<"[email protected]">>} ,{partitions,[]} ,{alarms,[{'[email protected]',[]}]} ]
Every node can`t find others

Replace gavinmroy/alpine-rabbitmq-autocluster with pivotalrabbitmq/rabbitmq-autocluster image

On the section AWS Section, there is the gavinmroy/alpine-rabbitmq-autocluster Docker image.

We should use the pivotalrabbitmq/rabbitmq-autocluster image and move the CloudFormation example inside the repo example directory

Docker image build

Hi, I was following the steps in
https://github.com/rabbitmq/rabbitmq-autocluster/tree/master/examples/k8s_statefulsets
to build the Docker image but I get the following error when starting the service on k8s:

14:36:44.684 [error] Failed to enable plugin "rabbitmq_autocluster": it may have been built with an incompatible (more recent?) version of Erlang

BOOT FAILED
===========

Error description:
    init:start_em/1
    init:start_it/1
    rabbit:start_it/1 line 454
    rabbit:broker_start/0 line 326
    rabbit_plugins:prepare_plugins/1 line 285
    rabbit_plugins:'-prepare_plugins/1-lc$^1/1-1-'/1 line 285
    rabbit_plugins:prepare_dir_plugin/1 line 449
throw:{plugin_built_with_incompatible_erlang,"rabbitmq_autocluster"}
Log file(s) (may contain more information):
   <stdout>

{"init terminating in do_boot",{plugin_built_with_incompatible_erlang,"rabbitmq_autocluster"}}

I build it using erlang 20.0 and elixir 1.4.5. I saw after that the dev requirements mention erlang 17.5 but I wasn't able to find an elixir version that works with erlang 17.5 so I'm not sure how you guys are building it.

Is the Docker image at https://hub.docker.com/r/gsantomaggio/rabbitmq-autocluster/ built using the same steps from the master branch?

Implement health-checks / status

Implement some command to check the cluster status, and integrate them with rabbitmqctl-diagnostics

We can also have a look this Mirantis commit

0.8.0 Release Timeline

Hello,

I saw this comment indicating that a 0.8.0 release of this plugin was coming soon:

#23 (comment)

Do you have an ETA for when a binary build of 0.8.0 may be available, as we're looking to use functionality introduced in that version. Also, will 3.6.9 be supported in the 0.8.0 release?

Cheers!

rabbitmq / rabbitmq-autocluster Goto Github PK

rabbitmq-autocluster's Introduction

RabbitMQ Autocluster

What it Does

Note for RabbitMQ 3.7.x Users

Supported Discovery Backends

Current Maintainers

Supported RabbitMQ Versions

Supported Erlang Versions

Binary Releases

Installation

Configuration

General settings

Available Settings

How to Configure Settings

Logging Configuration

AWS Configuration

Details

AWS API Configuration and Credentials

AWS Credentials and Configuration Settings

IAM Policy

Example Configuration

Example Cloud-Init

Consul configuration

Configuration Details

Example rabbitmq.config

Example Docker Compose File

DNS configuration

Example Configuration

Troubleshooting

etcd configuration

K8S configuration

Kubernetes Setup

Development

Requirements

Setup

Development environment

Make Commands

Docker

Testing Consul behaviors

License