Giter Club home page Giter Club logo

Comments (10)

michaelklishin avatar michaelklishin commented on September 15, 2024

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

  1. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
  2. We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details.
At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're
observing, or at least sharing as much relevant information as possible on the list:

  • Server, client library and plugin (if applicable) versions used
  • Server logs
  • A code example or terminal transcript that can be used to reproduce
  • Full exception stack traces (not a single line message)
  • rabbitmqctl status (and, if possible, rabbitmqctl environment output)
  • Other relevant things about the environment and workload, e.g. a traffic capture

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

Node '[email protected]' thinks it's clustered with node '[email protected]', but '[email protected]' disagrees

appears in this repository's issues as well as many other places. It means one node was reset and another one wasn't, so A thinks it is not already clustered with B and thus can join it but A disagrees. Resetting B will help. How exactly you can end up with this situation with various provisioning tools, I cannot know.

from rabbitmq-autocluster.

srflaxu40 avatar srflaxu40 commented on September 15, 2024

Hey, thanks @michaelklishin - sorry for not including greater details I also am hitting up the #autocluster channel in rmq slack. It seems even trying to manually join it still fails with the same error. Here is some more debugging i have done:

/ # rabbitmqctl reset Resetting node '[email protected]' Error: Mnesia is still running on node '[email protected]'. Please stop the node with rabbitmqctl stop_app first. / # rabbitmqctl stop_app Stopping rabbit application on node '[email protected]' / # rabbitmqctl reset Resetting node '[email protected]'

Appears the solution is I have to remote forget cluster node (not from same node):

/ # rabbitmqctl join_cluster [email protected] Clustering node '[email protected]' with '[email protected]' Error: {inconsistent_cluster,"Node '[email protected]' thinks it's clustered with node '[email protected]', but '[email protected]' disagrees"} / # rabbitmqctl join_cluster [email protected] Clustering node '[email protected]' with '[email protected]'

However, when I run the suggested cluster status:

/ # rabbitmqctl cluster_status Cluster status of node '[email protected]' [{nodes,[{disc,['[email protected]','[email protected]']}]}, {running_nodes,['[email protected]']}, {cluster_name,<<"rabbit@rabbitmq-statefulset-development-0.rabbitmq.default.svc.cluster.local">>}, {partitions,[]}, {alarms,[{'[email protected]',[]}]}]

I see two disc nodes but only one running. Upon inspection.. So I start the app on the downnode:

~/ops-tools/build-files/rabbitmq$ ./test_status.sh Cluster status of node '[email protected]' [{nodes,[{disc,['[email protected]','[email protected]']}]}, {running_nodes,['[email protected]','[email protected]']}, {cluster_name,<<"rabbit@rabbitmq-statefulset-development-0.rabbitmq.default.svc.cluster.local">>}, {partitions,[]}, {alarms,[{'[email protected]',[]},{'[email protected]',[]}]}]

Then works ^^. I thought this would be something handled by the plugin using default settings in my statefulset and service which I took from this repo..

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

Mnesia is still running on node '[email protected]'. Please stop the node with rabbitmqctl stop_app first

has a hint.

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

As the README for this plugin states, it is not a replacement for understanding of the basics of cluster formation. Please follow the clustering 101 transcript on rabbitmq.com and the meaning of the message(s) will be clearer.

from rabbitmq-autocluster.

srflaxu40 avatar srflaxu40 commented on September 15, 2024

@michaelklishin I understand but I feel it's a little more than that. The issue is on boot up one broker starts find however the second cannot cluster with the first. both are started using the defaults in the k8s examples.

I had even attempted cleaning up the mnesia stuff as I had found elsewhere:

`
rm -rf /var/lib/rabbitmq/* | true
rm -rf /rabbitmq/var/lib/rabbitmq/* | true

rabbitmq-server -detached
`

from rabbitmq-autocluster.

srflaxu40 avatar srflaxu40 commented on September 15, 2024

The second broker always starts / fails to join and crashes with the generic "node disagrees agree" error.

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

Removing a data directory without first stopping the node won’t get you where you want. There is only one scenario which produces the error message in question.

This is not a support forum. Please post step by step instructions to reproduce to rabbitmq-users or we won’t be able to help you.

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

Alternatively nodes can be reset without restarting with rabbitmqctl reset. It’s a good idea to reset both nodes before trying further.

from rabbitmq-autocluster.

michaelklishin avatar michaelklishin commented on September 15, 2024

Steps to roughly get into the state @srflaxu40's nodes are:

  • Start node A
  • Start node B
  • rabbitmqctl stop_app node A
  • rabbitmqctl join -n node-A node-B
  • Now both nodes are clustered
  • rabbitmqctl stop_app node A (or stop it any other way without resetting)
  • Wipe node A's data directory
  • Start node A
  • try to rabbitmqctl join -n node-A node-B — now A thinks it is not a member of a cluster with B but B thinks A is an existing member since it was never removed from (or reset, which notifies any running cluster peers)

How do we get out of this state? Reset node B or stop it and wipe its data directory, then restart.

This really isn't rocket science.

from rabbitmq-autocluster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.