codership / galera Goto Github PK
View Code? Open in Web Editor NEWSynchronous multi-master replication library
License: GNU General Public License v2.0
Synchronous multi-master replication library
License: GNU General Public License v2.0
Freshly installed cluster of 9 nodes configured to segments in the following way:
segment 1: node1, node4, node7
segment 2: node2, node5, node8
segment 3: node3, node6, node9
Nodes were started with test suite script:
./tests/scripts/command.sh restart
First view was formed by node1 alone:
2014-05-12 13:21:08 5274 [Note] WSREP: New cluster view: global state:
479307ce-d9d8-11e3-99d1-b6a5ab25498a:0, view# 1: Primary,
number of nodes: 1, my index: 0, protocol version 2
All other nodes were joined in the second view:
2014-05-12 13:21:32 5274 [Note] WSREP: New cluster view: global state:
479307ce-d9d8-11e3-99d1-b6a5ab25498a:0, view# 2: Primary,
number of nodes: 9, my index: 0, protocol version 2
All other nodes except node4 and node7 managed to request and receive state snapshot properly.
Node4 printed in its own log:
2014-05-12 13:21:35 5503 [Note] WSREP: Requesting state transfer failed:
-11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
but after this no trace of SST requests from node4 in any log. SST requesting thread was sitting in:
Thread 3 (Thread 0x7fe4c9285700 (LWP 5621)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fe4c61a36d4 in gcs_replv (conn=0x358fd10, act_in=0x7fe4c92832f0,
act=0x7fe4c9283410, scheduled=false) at gcs/src/gcs.c:1542
#2 0x00007fe4c619e6ff in gcs_repl (conn=0x358fd10, action=0x7fe4c9283410,
scheduled=false) at gcs/src/gcs.h:229
#3 0x00007fe4c61a3d20 in gcs_request_state_transfer (conn=0x358fd10,
req=0x7fe4ac00f1b0, size=50,
donor=0x7fe4c81283f8 <std::string::_Rep::_S_empty_rep_storage+24> "",
ist_uuid=0x7fe4c9283690, ist_seqno=-1, local=0x7fe4c92834f8)
at gcs/src/gcs.c:1643
#4 0x00007fe4c620292b in galera::Gcs::request_state_transfer (this=0x3588b60,
req=0x7fe4ac00f1b0, req_len=50, sst_donor=..., ist_uuid=..., ist_seqno=-1,
seqno_l=0x7fe4c92834f8) at galera/src/gcs.hpp:169
---Type <return> to continue, or q <return> to quit---
#5 0x00007fe4c62100e9 in galera::ReplicatorSMM::send_state_request (
this=0x3588620, req=0x7fe4ac00f180) at galera/src/replicator_str.cpp:566
#6 0x00007fe4c6210946 in galera::ReplicatorSMM::request_state_transfer (
this=0x3588620, recv_ctx=0x7fe4ac0009a0, group_uuid=..., group_seqno=0,
sst_req=0x7fe4ac00f150, sst_req_len=36)
at galera/src/replicator_str.cpp:657
#7 0x00007fe4c61fff7f in galera::ReplicatorSMM::process_conf_change (
this=0x3588620, recv_ctx=0x7fe4ac0009a0, view_info=..., repl_proto=5,
next_state=galera::Replicator::S_CONNECTED, seqno_l=2)
at galera/src/replicator_smm.cpp:1414
#8 0x00007fe4c61dc263 in galera::GcsActionSource::dispatch (this=0x3588ca8,
recv_ctx=0x7fe4ac0009a0, act=..., exit_loop=@0x7fe4c92843c9: false)
at galera/src/gcs_action_source.cpp:141
#9 0x00007fe4c61dc4d4 in galera::GcsActionSource::process (this=0x3588ca8,
recv_ctx=0x7fe4ac0009a0, exit_loop=@0x7fe4c92843c9: false)
at galera/src/gcs_action_source.cpp:176
#10 0x00007fe4c61f9f94 in galera::ReplicatorSMM::async_recv (this=0x3588620,
recv_ctx=0x7fe4ac0009a0) at galera/src/replicator_smm.cpp:357
#11 0x00007fe4c62150aa in galera_recv (gh=0x351fc10, recv_ctx=0x7fe4ac0009a0)
at galera/src/wsrep_provider.cpp:231
#12 0x00000000006403ac in wsrep_replication_process (thd=0x7fe4ac0009a0)
at /home/teemu/work/bzr/codership-mysql/5.6/sql/wsrep_thd.cc:309
#13 0x000000000061d906 in start_wsrep_THD (
arg=0x6402e5 <wsrep_replication_process(THD*)>)
at /home/teemu/work/bzr/codership-mysql/5.6/sql/mysqld.cc:5349
#14 0x00007fe4c8130f6e in start_thread (arg=0x7fe4c9285700)
at pthread_create.c:311
#15 0x00007fe4c763d9cd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Node7 printed:
2014-05-12 13:21:35 5504 [Note] WSREP: Requesting state transfer failed:
-11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
...
2014-05-12 13:21:43 5504 [Note] WSREP: Member 1.1 (node7) requested state
transfer from '*any*'. Selected 0.1 (node1)(SYNCED) as donor.
2014-05-12 13:21:43 5504 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 0)
but node1 said:
2014-05-12 13:21:41 5274 [Note] WSREP: Member 1.1 (node7) requested state
transfer from '*any*'. Selected 3.2 (node8)(SYNCED) as donor.
2014-05-12 13:21:42 5274 [ERROR] WSREP:
gcs/src/gcs_group.c:gcs_group_handle_state_request():1311: Member 1.1 (node7)
requested state transfer, but its state is JOINER. Ignoring.
2014-05-12 13:21:43 5274 [ERROR] WSREP:
gcs/src/gcs_group.c:gcs_group_handle_state_request():1311: Member 1.1 (node7)
requested state transfer, but its state is JOINER. Ignoring.
2014-05-12 13:21:44 5274 [Note] WSREP: 3.2 (node8): State transfer to 1.1 (node7)
complete.
and there is no trace of donating SST to node7 in node8's log, just:
2014-05-12 13:21:43 5211 [Note] WSREP: Member 1.1 (node7) requested state
transfer from '*any*'. Selected 0.1 (node1)(SYNCED) as donor.
Starting 9 nodes concurrently, one of the nodes timed out waiting for prim at the moment prim was being rebootstrapped:
140518 4:28:07 [Note] WSREP: re-bootstrapping prim from partitioned components
140518 4:28:08 [Note] WSREP: evs::proto(bdefe1c5, OPERATIONAL, view_id(REG,ba547f34,7)): state change: OPERATIONAL -> LEAVING
140518 4:28:08 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
140518 4:28:08 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140518 4:28:08 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.11:10071?gmcast.listen_addr=tcp://0.0.0.0:10081': -110 (Connection timed out)
140518 4:28:08 [ERROR] WSREP: gcs connect failed: Connection timed out
140518 4:28:08 [ERROR] WSREP: wsrep::connect() failed: 7
140518 4:28:08 [ERROR] Aborting
All the other aborted due to failure of reaching consensus:
140518 4:28:45 [Note] WSREP: going to give up, state dump for diagnosis:
evs::proto(evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)), GATHER) {
current_view=view(view_id(REG,ba547f34,7) memb {
ba547f34,
bded5f55,
bdee555e,
bdefe1c5,
be0d179f,
be12d899,
be183c60,
be25f051,
be2b058f,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=14,safe_seq=13,node_index=
node: {idx=0,range=[15,14],safe_seq=14}
node: {idx=1,range=[15,14],safe_seq=14}
node: {idx=2,range=[15,14],safe_seq=14}
node: {idx=3,range=[15,14],safe_seq=13}
node: {idx=4,range=[15,14],safe_seq=14}
node: {idx=5,range=[15,14],safe_seq=14}
node: {idx=6,range=[15,14],safe_seq=14}
node: {idx=7,range=[15,14],safe_seq=14}
node: {idx=8,range=[15,14],safe_seq=14} ,
msg_index= (0,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=ba547f34,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=201,nl=(
)
}
(1,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bded5f55,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=179,nl=(
)
}
(2,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdee555e,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=183,nl=(
)
}
(3,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=157,nl=(
)
}
(4,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be0d179f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=199,nl=(
)
}
(5,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be12d899,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=187,nl=(
)
}
(6,14),{v=0,t=1,ut=5,o=4,s=14,sr=0,as=13,f=4,src=be183c60,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=162,nl=(
)
}
(7,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be25f051,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=170,nl=(
)
}
(8,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=161,nl=(
)
}
,recovery_index=},
fifo_seq=224,
last_sent=14,
known:
ba547f34 at tcp://192.168.17.11:10011
{o=0,s=1,i=0,fs=231,}
bded5f55 at tcp://192.168.17.11:10071
{o=0,s=1,i=0,fs=209,}
bdee555e at tcp://192.168.17.11:10041
{o=0,s=1,i=0,fs=216,}
bdefe1c5 at tcp://192.168.17.12:10081
{o=0,s=1,i=0,fs=158,lm=
{v=0,t=6,ut=255,o=1,s=14,sr=-1,as=13,f=6,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=158,nl=(
)
},
}
be0d179f at tcp://192.168.17.13:10061
{o=0,s=1,i=0,fs=229,}
be12d899 at tcp://192.168.17.13:10031
{o=0,s=1,i=0,fs=216,}
be183c60 at tcp://192.168.17.12:10051
{o=0,s=1,i=0,fs=193,}
be25f051 at tcp://192.168.17.12:10021
{o=0,s=1,i=0,fs=198,}
be2b058f at
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=13,sr=-1,as=14,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=224,nl=(
ba547f34, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
bded5f55, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
bdee555e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
bdefe1c5, {o=0,s=1,f=0,ls=14,vid=view_id(REG,ba547f34,7),ss=13,ir=[15,14],}
be0d179f, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
be12d899, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
be183c60, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
be25f051, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
be2b058f, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
)
},
}
}
140518 4:28:45 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
at gcomm/src/evs_proto.cpp:handle_install_timer():612
It looks like leaving node failed to acknowledge all messages it had received due to exception and remaining nodes failed to reach consensus because of that.
To fix this, remaining nodes must decide at some point that leaving node won't be sending any more messages and ignore its safe seq in consensus computation. Raised suspected flag could be one such an indication.
There are a handful of user unserviceable parameters on galeraparameter.rst. Use the grayed out entries on the wiki as an indicator of which ones they are. Add some indication to galeraparameters.rst to show the reader that they are just for troubleshooting, and not for production. Strengthen the language in the entries to further stress this point.
Version: 5.6.15-56 Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061
node1 mysql> set global wsrep_osu_method=rsu;
Query OK, 0 rows affected (0.00 sec)
We switch to RSU method.
node1 mysql> set global wsrep_desync=off;
Query OK, 0 rows affected (0.00 sec)
2014-06-17 10:14:26 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:14:26 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:14:26 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:14:26 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)
2014-06-17 10:14:26 3875 [Note] WSREP: Synchronized with group, ready for connections
Node is primary and synced, ensuring wsrep_desync is off.
node1 mysql> alter table test add index (a);
2014-06-17 10:14:28 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:14:28 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
2014-06-17 10:14:28 3875 [Note] WSREP: Provider paused at 62eb8c72-f601-11e3-b42c-ab6847529d86:33 (54)
2014-06-17 10:14:30 3875 [Note] WSREP: resuming provider at 54
2014-06-17 10:14:30 3875 [Note] WSREP: Provider resumed.
2014-06-17 10:14:30 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:14:30 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:14:30 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:14:30 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)
2014-06-17 10:14:30 3875 [Note] WSREP: Synchronized with group, ready for connections
Query OK, 0 rows affected, 1 warning (2.21 sec)
Records: 0 Duplicates: 0 Warnings: 1
Index added (the warning is because it's a duplicate index)
node1 mysql> set global wsrep_desync=on;
2014-06-17 10:14:34 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:14:34 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
Query OK, 0 rows affected (0.02 sec)
Let's desync the node.
node1 mysql> alter table test add index (a);
2014-06-17 10:14:35 3875 [ERROR] WSREP: Node desync failed.: 11 (Resource temporarily unavailable)
at galera/src/replicator_smm.cpp:desync():1623
2014-06-17 10:14:35 3875 [Warning] WSREP: RSU desync failed 3 for alter table test add index (a)
2014-06-17 10:14:35 3875 [Warning] WSREP: ALTER TABLE isolation failure
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
If we then try do do an alter table, it fails immediately.
I believe that either it should become possible to do RSU on a desynced node, or a more appropriate error should be shown (Had customers using such scenarios which got them stuck).
This could be useful as part of a migration, where the node needs to be desynced during the process and a schema change is necessary.
It is not particularly important, but is a bug and triggers asserts in debug builds.
Headings on new pages in documentation require refs to allow for cross-referencing and section links.
For STL etc.
In the script there is a statement which tries to connect to one o the cluster nodes. The statement used is:
nc -z $HOST $PORT >/dev/null && break
This statement fails because the -z option is not supported anymore.
I changed this statement into
2>/dev/null >/dev/tcp/${HOST}/${PORT} && break
This new statement works as designed as the old nc -z statement
Relay set if not updated in gmcast_forget() after removing ProtoMap entry. There seem to be other places too where Proto entry is removed from ProtoMap but relay set is not updated accordingly. This may leave invalid pointer references into relay set.
Mostly due to endianness issues.
3.x branch has a new wsrep provider option pc.recovery
(boolean). If the value for this parameter is set to true
, primary component state is stored to disk and recovered from there when node is started.
Primary component is recovered automatically when all of the nodes that were present in the last known primary component have established communication between each other.
This feature solves two use cases:
Known limitations:
Version: 5.6.15-56 Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061
Running schema changes in parallel is useful In order to speed up the schema changes, so it uses more cpu cores. This also requires nodes to spend less time in 'maintenance mode'.
This currently fails with RSU.
The same error is given as in #63
On a single node, with 2 sessions, do...
First create 2 tables and put some data in it, so that the ALTER TABLE
takes a few seconds, enough to start another ALTER TABLE
on another table in another session.
session1 mysql> set global wsrep_osu_method=rsu;
Query OK, 0 rows affected (0.00 sec)
RSU
session1 mysql> alter table test add key (a);
Add an index, immediately run the other ALTER
on the second table in the second session:
session2 mysql> alter table test2 add key (a);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
You immediately get a deadlock issue.
Log output:
2014-06-17 10:26:04 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:26:04 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
2014-06-17 10:26:04 3875 [Note] WSREP: Provider paused at 62eb8c72-f601-11e3-b42c-ab6847529d86:33 (70)
2014-06-17 10:26:05 3875 [ERROR] WSREP: Node desync failed.: 11 (Resource temporarily unavailable)
at galera/src/replicator_smm.cpp:desync():1623
2014-06-17 10:26:05 3875 [Warning] WSREP: RSU desync failed 3 for alter table test2 add key (a)
2014-06-17 10:26:05 3875 [Warning] WSREP: ALTER TABLE isolation failure
2014-06-17 10:26:06 3875 [Note] WSREP: resuming provider at 70
2014-06-17 10:26:06 3875 [Note] WSREP: Provider resumed.
2014-06-17 10:26:06 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:26:06 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:26:06 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:26:06 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)
Issue tracker for new regression tests, regression test fixes and enhancements.
if I have two IPs on one node, and start two mysql/galera instances with the two different IPs, they can not work well because they are use wrong IPs.
For example one node has 192.168.0.100/101. Instance A use 192.168.0.100 as listen address, and B use 192.168.0.101 as listen address. When A connects to B, A use 192.168.0.101 to connect. And when B find connection's remote/peer address is self listen address, B will ignore this connection, which cause communication problem between them.
I guess some users also have this problem. https://groups.google.com/forum/#!topic/codership-team/_DCRJIgKY20/discussion
The way to fix it is to bind listen address IP when connect to other nodes.
UPDATE:
related issues:
https://bugs.launchpad.net/galera/+bug/1240964
t801
Nine node multi-site setup, started all nine nodes simultaneously. Three of them crashed in PC layer exception after full group of 9 was reached. Debug log from one of the crashed nodes:
140515 14:02:01 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '192.168.17.12:10081'
140515 14:02:01 [Note] WSREP: evs::proto(7ccb744c, CLOSED, view_id(TRANS,7ccb744c,0)): state change: CLOSED -> JOINING
140515 14:02:03 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting on, nonlive peers: tcp://192.168.17.11:10011 tcp://192.168.17.11:10041 tcp://192.168.17.11:10071 tcp://192.168.17.12:10021 tcp://192.168.17.12:10051 tcp://192.168.17.13:10031 tcp://192.168.17.13:10061
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)): detected new message source 7cfbf591
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)): shift to GATHER due to foreign message from 7cfbf591
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)): state change: JOINING -> GATHER
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): detected new message source 7da57d09
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): shift to GATHER due to foreign message from 7da57d09
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): detected new message source 7dc7b695
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): shift to GATHER due to foreign message from 7dc7b695
140515 14:02:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.12368S), skipping check
140515 14:02:06 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting off
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): setting 7d1499c5 inactive in asymmetry elimination
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): before asym elimination
1 1 1 1 1 1 1 1 1
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): after asym elimination
1 1 1 0 1 1 1 1 1
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): state change: GATHER -> INSTALL
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(TRANS,7ccb744c,0)): state change: INSTALL -> OPERATIONAL
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(TRANS,7ccb744c,0)): delivering view view(view_id(TRANS,7ccb744c,0) memb {
7ccb744c,0
} joined {
} left {
} partitioned {
})
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)): delivering view view(view_id(REG,79dcfbbc,3) memb {
79dcfbbc,0
7ccb744c,0
7cfbf591,0
7d6503d1,0
7d75423e,0
7da57d09,0
7dc7b695,0
7ddadfcf,0
} joined {
79dcfbbc,0
7cfbf591,0
7d6503d1,0
7d75423e,0
7da57d09,0
7dc7b695,0
7ddadfcf,0
} left {
} partitioned {
})
140515 14:02:08 [Note] WSREP: declaring 79dcfbbc at tcp://192.168.17.11:10011 stable
140515 14:02:08 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:08 [Note] WSREP: declaring 7d6503d1 at tcp://192.168.17.11:10071 stable
140515 14:02:08 [Note] WSREP: declaring 7d75423e at tcp://192.168.17.11:10041 stable
140515 14:02:08 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:08 [Note] WSREP: declaring 7dc7b695 at tcp://192.168.17.12:10051 stable
140515 14:02:08 [Note] WSREP: declaring 7ddadfcf at tcp://192.168.17.12:10081 stable
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)): detected new message source 7d1499c5
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)): shift to GATHER due to foreign message from 7d1499c5
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)): state change: OPERATIONAL -> GATHER
140515 14:02:08 [Note] WSREP: Node 79dcfbbc state prim
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 79dcfbbc inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d1499c5 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d6503d1 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d75423e inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7dc7b695 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7ddadfcf inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): before asym elimination
1 1 1 1 1 1 1 1 1
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): after asym elimination
0 1 1 0 0 0 1 0 0
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 79dcfbbc requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7d6503d1 requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7d75423e requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7dc7b695 requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7ddadfcf requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): sending install message
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): state change: GATHER -> INSTALL
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,79dcfbbc,3)): state change: INSTALL -> OPERATIONAL
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,79dcfbbc,3)): delivering view view(view_id(TRANS,79dcfbbc,3) memb {
7ccb744c,0
7cfbf591,0
7da57d09,0
} joined {
} left {
} partitioned {
79dcfbbc,0
7d6503d1,0
7d75423e,0
7dc7b695,0
7ddadfcf,0
})
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)): delivering view view(view_id(REG,7ccb744c,4) memb {
7ccb744c,0
7cfbf591,0
7da57d09,0
} joined {
} left {
} partitioned {
79dcfbbc,0
7d6503d1,0
7d75423e,0
7dc7b695,0
7ddadfcf,0
})
140515 14:02:13 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:13 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)): detected new message source 7d1499c5
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)): shift to GATHER due to foreign message from 7d1499c5
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)): state change: OPERATIONAL -> GATHER
140515 14:02:13 [Note] WSREP: Node 7cfbf591 state prim
140515 14:02:13 [Warning] WSREP: 7ccb744c sending install message failed: Resource temporarily unavailable
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): detected new message source 79dcfbbc
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): shift to GATHER due to foreign message from 79dcfbbc
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): detected new message source 7d6503d1
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): shift to GATHER due to foreign message from 7d6503d1
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): detected new message source 7d75423e
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): shift to GATHER due to foreign message from 7d75423e
140515 14:02:14 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting on, nonlive peers: tcp://192.168.17.11:10011 tcp://192.168.17.11:10041 tcp://192.168.17.11:10071
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:18 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)): state change: GATHER -> INSTALL
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,7ccb744c,4)): state change: INSTALL -> OPERATIONAL
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,7ccb744c,4)): delivering view view(view_id(TRANS,7ccb744c,4) memb {
7ccb744c,0
7cfbf591,0
7da57d09,0
} joined {
} left {
} partitioned {
})
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,5)): delivering view view(view_id(REG,79dcfbbc,5) memb {
79dcfbbc,0
7ccb744c,0
7cfbf591,0
7d1499c5,0
7d6503d1,0
7d75423e,0
7da57d09,0
7dc7b695,0
7ddadfcf,0
} joined {
79dcfbbc,0
7d1499c5,0
7d6503d1,0
7d75423e,0
7dc7b695,0
7ddadfcf,0
} left {
} partitioned {
})
140515 14:02:19 [Note] WSREP: declaring 79dcfbbc at tcp://192.168.17.11:10011 stable
140515 14:02:19 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:19 [Note] WSREP: declaring 7d1499c5 at tcp://192.168.17.13:10061 stable
140515 14:02:19 [Note] WSREP: declaring 7d6503d1 at tcp://192.168.17.11:10071 stable
140515 14:02:19 [Note] WSREP: declaring 7d75423e at tcp://192.168.17.11:10041 stable
140515 14:02:19 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:19 [Note] WSREP: declaring 7dc7b695 at tcp://192.168.17.12:10051 stable
140515 14:02:19 [Note] WSREP: declaring 7ddadfcf at tcp://192.168.17.12:10081 stable
140515 14:02:19 [Note] WSREP: Node 79dcfbbc state prim
140515 14:02:20 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=7ccb744c,start_prim=0,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=0,checksum=0,instances=
79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=1
7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=3
7d1499c5,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=2
7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
,state_msgs=
79dcfbbc,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
7ccb744c,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=1
7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=3
7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=2
7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
}}
7cfbf591,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=1
7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=3
7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=3
7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=2
7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
}}
7d1499c5,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 7d1499c5,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
}}
7d6503d1,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
7d75423e,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
7da57d09,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=1
7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=3
7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=3
7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=2
7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
}}
7dc7b695,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
7ddadfcf,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
,current_view=view(view_id(REG,79dcfbbc,5) memb {
79dcfbbc,0
7ccb744c,0
7cfbf591,0
7d1499c5,0
7d6503d1,0
7d75423e,0
7da57d09,0
7dc7b695,0
7ddadfcf,0
} joined {
79dcfbbc,0
7d1499c5,0
7d6503d1,0
7d75423e,0
7dc7b695,0
7ddadfcf,0
} left {
} partitioned {
}),pc_view=view((empty)),mtu=2147483647}
140515 14:02:20 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7ddadfcf,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=142,nl=(
)
} 272
140515 14:02:20 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=7cfbf591,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=112,nl=(
)
}
state after handling message: evs::proto(evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,5)), OPERATIONAL) {
current_view=view(view_id(REG,79dcfbbc,5) memb {
79dcfbbc,0
7ccb744c,0
7cfbf591,0
7d1499c5,0
7d6503d1,0
7d75423e,0
7da57d09,0
7dc7b695,0
7ddadfcf,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} node: {idx=3,range=[1,0],safe_seq=0} node: {idx=4,range=[1,0],safe_seq=0} node: {idx=5,range=[1,0],safe_seq=0} node: {idx=6,range=[1,0],safe_seq=0} node: {idx=7,range=[1,0],safe_seq=0} node: {idx=8,range=[1,0],safe_seq=0} ,msg_index= (8,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7ddadfcf,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=142,nl=(
)
}
,recovery_index= (0,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=79dcfbbc,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=163,nl=(
)
}
(1,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=0,src=7ccb744c,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=109,nl=(
)
}
(2,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7cfbf591,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=110,nl=(
)
}
(3,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d1499c5,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=134,nl=(
)
}
(4,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d6503d1,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=151,nl=(
)
}
(5,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d75423e,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=145,nl=(
)
}
(6,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7da57d09,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=120,nl=(
)
}
(7,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7dc7b695,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=139,nl=(
)
}
},
fifo_seq=111,
last_sent=0,
known:
79dcfbbc at tcp://192.168.17.11:10011
{o=1,s=0,i=1,fs=165,}
7ccb744c at
{o=1,s=0,i=1,fs=-1,}
7cfbf591 at tcp://192.168.17.13:10031
{o=1,s=0,i=1,fs=112,}
7d1499c5 at tcp://192.168.17.13:10061
{o=1,s=0,i=1,fs=136,}
7d6503d1 at tcp://192.168.17.11:10071
{o=1,s=0,i=1,fs=153,}
7d75423e at tcp://192.168.17.11:10041
{o=1,s=0,i=1,fs=147,}
7da57d09 at tcp://192.168.17.12:10021
{o=1,s=0,i=1,fs=122,}
7dc7b695 at tcp://192.168.17.12:10051
{o=1,s=0,i=1,fs=141,}
7ddadfcf at tcp://192.168.17.12:10081
{o=1,s=0,i=1,fs=144,}
}140515 14:02:20 [ERROR] WSREP: failed to open gcomm backend connection: 131: 7ccb744c last prims not consistent (FATAL)
at gcomm/src/pc_proto.cpp:is_prim():736
at gcomm/src/pc_proto.cpp:handle_msg():1349
at gcomm/src/evs_proto.cpp:handle_gap():3286
at gcomm/src/evs_proto.cpp:handle_msg():2108
140515 14:02:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -131 (State not recoverable)
140515 14:02:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.12:10081?gmcast.listen_addr=tcp://0.0.0.0:10091': -131 (State not recoverable)
140515 14:02:20 [ERROR] WSREP: gcs connect failed: State not recoverable
140515 14:02:20 [ERROR] WSREP: wsrep::connect() failed: 7
140515 14:02:20 [ERROR] Aborting
140515 14:02:20 [Note] WSREP: Service disconnected.
140515 14:02:21 [Note] WSREP: Some threads may fail to exit.
140515 14:02:21 [Note] /home/vagrant/galera/local9/mysql/sbin/mysqld: Shutdown complete
The old configuration page carried a section on failure simulation that made three general points on ways in which it could be done. Expand this into a full tutorial on how to break the cluster in controlled ways to test replication and prepare for actual breakage. Add the new page to the Getting Started Guide, either as a section on testingcluster.rst or a separate page just after.
Maybe related to already closed #38.
2014-05-30 16:05:51 27259 [Note] WSREP: Node 10d5d6f4 state prim
2014-05-30 16:05:51 27259 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
...
2014-05-30 16:05:51 27259 [ERROR] WSREP: exception from gcomm, backend must be restarted: 14c8c47a last prims not consistent (FATAL)
at gcomm/src/pc_proto.cpp:is_prim():787
at gcomm/src/pc_proto.cpp:handle_msg():1402
at gcomm/src/evs_proto.cpp:handle_gap():3299
at gcomm/src/evs_proto.cpp:handle_msg():2127
2014-05-30 16:05:51 27259 [Note] WSREP: Received self-leave message.
2014-05-30 16:05:51 27259 [Note] WSREP: Flow-control interval: [0, 0]
2014-05-30 16:05:51 27259 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-05-30 16:05:51 27259 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
2014-05-30 16:05:51 27259 [Note] WSREP: RECV thread exiting 0: Success
2014-05-30 16:05:51 27259 [Note] WSREP: New cluster view: global state: 10d6e99c-e7fe-11e3-9562-9795814c7421:2182875, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2014-05-30 16:05:51 27259 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing send monitor...
2014-05-30 16:05:51 27259 [Note] WSREP: Closed send monitor.
2014-05-30 16:05:51 27259 [Note] WSREP: recv_thread() joined.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing replication queue.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing slave action queue.
2014-05-30 16:05:51 27259 [Note] WSREP: applier thread exiting (code:0)
Three threads were remaining:
Thread 3 (Thread 0x7f1017112700 (LWP 27264)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f1015cf6e8d in gu::Lock::wait (this=<optimized out>, cond=...)
at galerautils/src/gu_lock.hpp:56
#2 0x00007f1015de0962 in galera::ServiceThd::thd_func (arg=0x2c2d260)
at galera/src/galera_service_thd.cpp:30
#3 0x00007f1017cece9a in start_thread (arg=0x7f1017112700)
at pthread_create.c:308
#4 0x00007f10172073fd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f1006ffc700 (LWP 27289)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00000000005d5133 in inline_mysql_cond_wait (src_line=403,
src_file=0xb6e838 "/home/vagrant/codership-mysql/sql/wsrep_thd.cc",
mutex=<optimized out>, that=<optimized out>)
at /home/vagrant/codership-mysql/include/mysql/psi/mysql_thread.h:1162
#2 wsrep_rollback_process (thd=0x7f0ff8000990)
at /home/vagrant/codership-mysql/sql/wsrep_thd.cc:403
#3 0x00000000005bd137 in start_wsrep_THD (arg=0x5d4c70)
at /home/vagrant/codership-mysql/sql/mysqld.cc:5350
#4 0x00007f1017cece9a in start_thread (arg=0x7f1006ffc700)
at pthread_create.c:308
#5 0x00007f10172073fd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f10191aa740 (LWP 27259)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00000000005ce5db in inline_mysql_cond_wait (mutex=<optimized out>,
that=<optimized out>, src_file=<optimized out>, src_line=<optimized out>)
at /home/vagrant/codership-mysql/include/mysql/psi/mysql_thread.h:1162
#2 inline_mysql_cond_wait (src_line=199, mutex=<optimized out>,
that=<optimized out>, src_file=<optimized out>)
at /home/vagrant/codership-mysql/sql/wsrep_sst.cc:193
#3 wsrep_sst_wait () at /home/vagrant/codership-mysql/sql/wsrep_sst.cc:199
#4 0x00000000005c9805 in wsrep_init_startup (first=true)
at /home/vagrant/codership-mysql/sql/wsrep_mysqld.cc:699
#5 0x00000000005c1636 in init_server_components ()
at /home/vagrant/codership-mysql/sql/mysqld.cc:4946
#6 0x00000000005c2235 in mysqld_main (argc=36, argv=0x2baa3d8)
at /home/vagrant/codership-mysql/sql/mysqld.cc:6063
#7 0x00007f101713476d in __libc_start_main (
main=0x5a14b0 <main(int, char**)>, argc=15, ubp_av=0x7fffad9c8cd8,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=0x7fffad9c8cc8) at libc-start.c:226
#8 0x00000000005b4a5d in _start ()
However, it is a bit unclear if this is Galera or MySQL side issue.
They assume just 'mysqld' when looking through cnf files.
wsrep_flow_control_paused_ns is not documented on http://galeracluster.com/documentation-webpages/galerastatusvariables.html#wsrep-flow-control-paused
PXC 5.6 has a brief mention about it: http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html
some possible bugs with patch, but need to be confirmed.
When combining asynchronous replication with Galera, we need to track what is the last Xid applied, if we want to replicate from another "Master" node.
Currently, the only way to know the last GTID applied is to:
This is very manual, could be automated, but also could be enhanced by having the following:
Example of implementation:
http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-replication-schema.html
Thanks,
Joffrey
wsrep_desync is missing from http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html
On the old documentation wiki, there was a section on the Scriptable SST page that covered calling conventions, which did not get carried over onto the new docs.
The rest of the Scriptable SST material can be found at the bottom of statetransfer.rst. If the addition makes the page too long, consider breaking it off into a separate scriptablesst.rst file.
At the moment, symptomsolution.rst renders as a table where the particular symptom is listed in the left column and a very detailed solution is listed in the right column. While the logic of this is pretty clear, on a webpage the table formatting makes it difficult to read and find what you're looking for. Additionally, the table entries do not posses id references, so, in the event of a client having difficulties, it's not possible to link them to a particular symptom / solution pairing on the page.
Rewrite the table material into separate sections. If there are enough entries to justify it, set up a link table like on the parameters pages.
Change the Galera documentation over to the new theme.
The Sphinx build system by default writes in numbering on section headers. Trim this down for a cleaner look.
Adapt the header pages to the new format. Replace links with :doc: elements. Use a hidden toctree to keep the navigation links in line.
Hi,
in ndbcluster, when MySQL server is restarted, it writes a LOST_EVENTS in the binary log. This is a message that we need, on the slave, to issue replication channel failover to another node that was up during the restart, in order not to loose transactions.
Can we add this functionality in Galera. This could be:
Thanks,
Joffrey
Implement gu::Status
class for gathering Galera wide status variables dynamically. Append dynamical status variables in galera::ReplicatorSMM::stats_get()
.
Happened during 3-node stop/cont test, node 1 was stopped for 5 seconds.
Try 4/10
Signaling node local1 with STOP...
140523 02:14:55.512 Job 'signal_cmd' on 'local1' complete in 0 seconds,
Sleeping for 5 sec.
Signaling node local1 with CONT...
140523 02:15:00.528 Job 'signal_cmd' on 'local1' complete in 0 seconds,
Node 2 crashed with exception:
2014-05-23 02:15:00 22990 [ERROR] WSREP: exception from gcomm, backend must be restarted: mn.operational() == false: (FATAL)
at gcomm/src/evs_proto.cpp:deliver_trans():2820
and a bit later node 3 crashed with
2014-05-23 02:15:38 23382 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
at gcomm/src/evs_proto.cpp:handle_install_timer():614
which is probably another incarnation of #37.
Log from node 2 (adc6d070):
2014-05-23 02:15:00 22990 [Note] WSREP: filtering out trans message higher than
install message hs 3253: {v=0,t=1,ut=255,o=0,s=3255,sr=0,as=3252,f=6,src=c4d1ce2
e,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=43787,nl=(
)
}
2014-05-23 02:15:00 22990 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=-1,sr=-1,as=-1,f=4,src=bc98dc51,srcvid=view_id(REG,adc6d070,25),ru=00000000,r=[-1,-1],fs=47439,nl=(
)
}
state after handling message: evs::proto(evs::proto(adc6d070, INSTALL, view_id(REG,adc6d070,24)), INSTALL) {
current_view=view(view_id(REG,adc6d070,24) memb {
adc6d070,0
bc98dc51,0
c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,range=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3253} node: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=66581,
last_sent=3256,
known:
adc6d070 at
{o=1,s=0,i=1,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66575,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
bc98dc51 at tcp://10.0.2.15:10031
{o=1,s=0,i=1,fs=47439,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47437,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
}
}2014-05-23 02:15:00 22990 [ERROR] WSREP: exception from gcomm, backend must be restarted: mn.operational() == false: (FATAL)
at gcomm/src/evs_proto.cpp:deliver_trans():2820
and from node 3 (bc98dc51):
G,adc6d070,24)) suspecting node: adc6d070
2014-05-23 02:15:07 23382 [Note] WSREP: evs::proto(bc98dc51, INSTALL, view_id(RE
G,adc6d070,24)) suspecting node: c4d1ce2e
2014-05-23 02:15:08 23382 [Warning] WSREP: evs::proto(bc98dc51, INSTALL, view_id
(REG,adc6d070,24)) install timer expired
evs::proto(evs::proto(bc98dc51, INSTALL, view_id(REG,adc6d070,24)), INSTALL) {
current_view=view(view_id(REG,adc6d070,24) memb {
adc6d070,0
bc98dc51,0
c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,ra
nge=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3255} nod
e: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=47455,
last_sent=3255,
known:
adc6d070 at tcp://10.0.2.15:10021
{o=1,s=1,i=0,fs=66580,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=adc6d070,srcvid=view_id(REG,adc
6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[32
56,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
bc98dc51 at
{o=1,s=0,i=1,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47437,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
}
}
2014-05-23 02:15:08 23382 [Note] WSREP: evs::proto(bc98dc51, INSTALL, view_id(REG,adc6d070,24)) node c4d1ce2e failed to commit for install message, declaring inactive
...
2014-05-23 02:15:30 23382 [Note] WSREP: max install timeouts reached, will isola
te node for PT20S
2014-05-23 02:15:30 23382 [Note] WSREP: no install message received
2014-05-23 02:15:38 23382 [Warning] WSREP: evs::proto(bc98dc51, GATHER, view_id(
REG,adc6d070,24)) install timer expired
...
2014-05-23 02:15:38 23382 [Note] WSREP: going to give up, state dump for diagnos
is:
evs::proto(evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)), GATHER) {
current_view=view(view_id(REG,adc6d070,24) memb {
adc6d070,0
bc98dc51,0
c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,range=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3255} node: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=47489,
last_sent=3255,
known:
adc6d070 at tcp://10.0.2.15:10021
{o=0,s=1,i=0,fs=66580,}
bc98dc51 at
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3255,f=0,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47489,nl=(
adc6d070, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3255,ir=[3257,3256],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3255,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3257,3256],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
}
2014-05-23 02:15:38 23382 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
at gcomm/src/evs_proto.cpp:handle_install_timer():614
Possible sequence of events why this happened:
install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
To prevent this from happening, filtering incoming messages in install stage must be tightened. If install message has been received, do more strict checking for incoming messages:
This will work because:
copy from t261
Average message replication latency. Since GCS handle messages asynchronously, it's better to measure it in GCOMM module.
Originally reported in: https://bugs.launchpad.net/galera/+bug/1271918
The problematic part in membership protocol is the time when install message is received. When many nodes will suddenly start to see each other it is probable that they have different set of known nodes than representative at the time of processing install message. Therefore handling install message and install state message processing should be altered somewhat to make group forming more stable.
Related issue is still open because of missing install state improvements: https://bugs.launchpad.net/galera/+bug/1249805
To avoid that /etc/default/garbd
must be listed in a file called "conffiles" in the
debian folder.
Update the Administrative Guide for new text and formats.
But scripts/mysql/debian/5.5.list requires Docs/mysql.info this file in MySQL source code tree
f 644 root root $DOCS_DST/mysql.info $MYSQL_SRC/Docs/mysql.info
there are at least three types of UUID
and three different files involves uuid process
we'd better to perform refactoring of UUID, and move some common code to galerautils module.
There are a handful of files that are either part of an older draft or have had their material integrated elsewhere in the documentation. (See the Sphinx warnings for reference.) There are backups on hand if they need to be referenced later, but should be removed from main docs build to avoid loose HTML files turning up on the new site.
This also should help with portability since 8-byte _sync* builtins are not available on every platform.
See
https://gcc.gnu.org/onlinedocs/gcc-4.8.3/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins
https://gcc.gnu.org/onlinedocs/gcc-4.8.3/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins
The landing page for the docs and the files in the Overview section require general revisions. The landing page, index.rst, contains a brief line, the toctree, links to index and search, and the legal notice. Overview contains a brief introduction to Galera Cluster and a few files more appropriate to the Reference section.
Move the introductory material from Overview up onto index.rst. Move reference material in Overview and the legal notice on index.rst down into the Reference section.
Maybe related to #10, node uuid is preserved over restarts.
currently there are two states stored in two different file:
but it's just too ugly. so we need some utility or library offers capabilities:
ps: actually here 'state' could be extended and referred as any data.
In particular this concerns "commit cut" message. While normally this should not be a problem, imagine a situation when PC with UUID1:1000 is followed by PC with UUID2:10. In this case commit cut 100 may cause some gcache buffers to be released too soon.
Solution seems to be to accompany each message with UUID of the configuration it was sent from. If the message is delivered in wrong configuration it may be discarded.
The probability of harm is low, however this may result in data loss. The fix requires upgrade in GCS protocol.
Originally reported here: https://groups.google.com/forum/#!topic/codership-team/O4vHuGBKRC0
http://galeracluster.com/documentation-webpages/galeraparameters.html - 'replication.*'
should be 'repl.*'
.
missing options on this page:
socket.checksum
repl.key_format
repl.proto_max
repl.max_ws_size
pc.announce_timeout
pc.wait_prim_timeout
gmcast.segment
base_host
base_port
cert.log_conflicts
other suggestions:
4.16.6 selinux - a link to https://blogs.oracle.com/jsmyth/entry/selinux_and_mysql would probably help
5.3.1. Diagnosing Multi-Master Conflicts
could include a reference to cert.log_conflicts
Currently a single inconsistent node can abort all other nodes by committing a transaction that cannot be applied on slaves. This is because a slave has no way to establish which node is at fault - it or the master of transaction.
The task is to implement post-factum error voting before abort. If majority of the nodes see the same error with the same transaction, they will consider themselves consistent and force minority abort.
Depends on #1
The environment is a Percona XtraDB Cluster composed of three nodes: db1, db2 and db3). db2 was acting as primary node when it got overloaded a few times, in the last of which the gcomm background thread was stalled for around 20 seconds:
2014-05-26 14:42:49 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT21.1142S), skipping check
Because of that the node dropped from the group while it was requesting for SST but soon after it decided to abort, giving SST was not possible.
While aborting it tried to leave the cluster gracefully by closeing the gcomm connection, which caused some message exchange between nodes and triggered a bug on db1 and db3, effectivelly crashing both nodes:
db1:
2014-05-26 14:43:21 2979 [Warning] WSREP: evs::proto(9bdb737e-df4a-11e3-87c9-eab020c42bd0, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 2979 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)
db3:
2014-05-26 14:43:21 30104 [Warning] WSREP: evs::proto(c7a2a117-daa4-11e3-8b73-863bb950f40a, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 30104 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)
This happened in Percona-XtraDB-Cluster-galera-3-3.5-1.216.rhel6.x86_64, built with Galera 25.3.4
On Fedora (and maybe some more distributions) there is in the /etc/sudoers file an line which says:
Defaults requiretty
You must comment out this line before you can start garbd because in the /etc/init.d/garb there is a sudo -u nobody statement which fails on the standard sudoers file of Fedora.
The page says:
"Galera Cluster backups can be performed just as regular MySQL backups, using a backup script. Since all the cluster nodes are identical, backing up one node backs up the entire cluster.
However, such backups will have no global transaction IDs associated with them. You can use these backups to recover data, but they cannot be used to recover a Galera Cluster node to a well-defined state. Furthermore, the backup procedure may block the cluster operation for the duration of backup, in the case of a blocking backup."
However
wsrep_desync
option which is present in resent versions of Galera ClusterThis section should be expanded to cover ways to take backups without the aid of garbd.
copy from t848
typical scenario is data center outage. after power supply is back, cluster nodes will be promoted to primary component automatically.
the designed process is
Example (happened over WiFi link):
140511 13:38:32 [Warning] WSREP: Quorum: No node with complete state:
Version : 2
Flags : 3
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 2ab816e0-d8f8-11e3-a62c-639606db485d
Prim seqno : 37
Last seqno : 0
Prim JOINED : 3
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'home0'
Incoming addr: '10.21.32.1:3305'
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 2ab816e0-d8f8-11e3-a62c-639606db485d
Prim seqno : 37
Last seqno : 0
Prim JOINED : 3
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'home2'
Incoming addr: '10.21.32.1:3303'
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 2ab816e0-d8f8-11e3-a62c-639606db485d
Prim seqno : 37
Last seqno : 0
Prim JOINED : 3
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'home1'
Incoming addr: '10.21.32.1:3304'
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
Prim seqno : 35
Last seqno : 0
Prim JOINED : 6
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'centos6-32'
Incoming addr: '192.168.122.119:3306'
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
Prim seqno : 35
Last seqno : 0
Prim JOINED : 6
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'centos5-64'
Incoming addr: '192.168.122.195:3306'
Version : 2
Flags : 0
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : PRIMARY
Prim UUID : 2ab816e0-d8f8-11e3-a62c-639606db485d
Prim seqno : 37
Last seqno : 0
Prim JOINED : 3
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'centos6-64'
Incoming addr: '192.168.122.192:3306'
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
Prim seqno : 35
Last seqno : 0
Prim JOINED : 6
State UUID : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
Group UUID : 2108a884-d873-11e3-a155-d6a734a2dc8b
Name : 'centos5-32'
Incoming addr: '192.168.122.135:3306'
mysqld: gcs/src/gcs_state_msg.c:530: state_quorum_remerge: Assertion `states[i]->prim_joined == candidates[j].prim_joined' failed.
10:38:32 UTC - mysqld got signal 6 ;
There is a merge of two components, 35 and 37, with the same uuid:seqno. The problem is that the component ID is not taken into consideration.
In this particular case it would have merged if debug was disabled, but it is easy to imagine a situation where it won't (e.g. 3 nodes form 37 + 2 nodes from 35 = 5. 35 had 6 so no full remerge)
Exception thrown from gcomm caused gcs gcomm backend to generate self-leave message. After processing self-leave, only one of the gcs threads is able to exit cleanly:
140518 4:28:46 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(ba547f34, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
at gcomm/src/evs_proto.cpp:handle_install_timer():612
140518 4:28:46 [Note] WSREP: Received self-leave message.
140518 4:28:46 [Note] WSREP: Flow-control interval: [0, 0]
140518 4:28:46 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140518 4:28:46 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
140518 4:28:46 [Note] WSREP: RECV thread exiting 0: Success
140518 4:28:46 [Note] WSREP: New cluster view: global state: 1e57ce94-ddb7-11e3-96a4-06023ac83134:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
140518 4:28:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140518 4:28:46 [Note] WSREP: applier thread exiting (code:0)
All other receiver threads are stuck inside gcs waiting for messages:
Thread 2 (Thread 0x7f5be41dc700 (LWP 18645)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f5be6a54cae in fifo_lock_get (q=q@entry=0x2cb8520)
at galerautils/src/gu_fifo.c:233
#2 0x00007f5be6a552a8 in gu_fifo_get_head (q=0x2cb8520,
err=err@entry=0x7f5be41db28c) at galerautils/src/gu_fifo.c:292
#3 0x00007f5be6b0d85c in gcs_recv (conn=0x2c8ffe0, action=0x7f5be41db2e0)
at gcs/src/gcs.c:1671
#4 0x00007f5be6b39deb in galera::Gcs::recv (this=<optimized out>, act=...)
at galera/src/gcs.hpp:103
#5 0x00007f5be6b22f98 in galera::GcsActionSource::process (this=0x2c89450,
recv_ctx=0x7f5bb4000a00, exit_loop=@0x7f5be41db35d: false)
at galera/src/gcs_action_source.cpp:164
#6 0x00007f5be6b37deb in galera::ReplicatorSMM::async_recv (this=0x2c88e90,
recv_ctx=0x7f5bb4000a00) at galera/src/replicator_smm.cpp:390
#7 0x00007f5be6b4351d in galera_recv (gh=<optimized out>,
recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:213
#8 0x000000000067eec3 in wsrep_replication_process (thd=0x7f5bb4000a00)
at /home/teemu/work/bzr/codership-mysql/5.5/sql/wsrep_thd.cc:253
#9 0x000000000050d451 in start_wsrep_THD (
arg=0x67ee4e <wsrep_replication_process(THD*)>)
at /home/teemu/work/bzr/codership-mysql/5.5/sql/mysqld.cc:4482
#10 0x00007f5be9066f6e in start_thread (arg=0x7f5be41dc700)
at pthread_create.c:311
#11 0x00007f5be814f9cd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Both gcs conn
and conn->core
show closed state:
(gdb) p *conn
$1 = {my_idx = -1, memb_num = 0, my_name = 0x0, channel = 0x0, socket = 0x0,
state = GCS_CONN_CLOSED, config = 0x2c88ea0, config_is_local = false,
params = {fc_resume_factor = 1, recv_q_soft_limit = 0.25,
max_throttle = 0.25, recv_q_hard_limit = 8301034833169298432,
fc_base_limit = 16, max_packet_size = 64500, fc_debug = 0,
fc_master_slave = false, sync_donor = false}, gcache = 0x2c890c0,
sm = 0x7f5be9bc3010, local_act_id = 7, global_seqno = 0, repl_q = 0x2c903a0,
send_thread = 0, recv_q = 0x2cb8520, recv_q_size = 0,
recv_thread = 140032442623744, timeout = 9223372035999999999, fc_lock = {
__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0,
__spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, conf_id = 4294967295,
stop_sent = 0, stop_count = 0, queue_len = 0, upper_limit = 0,
lower_limit = 0, fc_offset = 0, max_fc_state = GCS_CONN_JOINED,
stats_fc_sent = 0, stats_fc_received = 0, stfc = {
hard_limit = 8301034833169298432, soft_limit = 2075258708292324608,
max_throttle = 0.25, init_size = 0, size = 0, last_sleep = 0,
act_count = 0, max_rate = 0, scale = 0, offset = 0, start = 0, debug = 0,
sleep_count = 0, sleeps = 0}, need_to_join = false, join_seqno = 0,
sync_sent = false, core = 0x2c901c0}
(gdb) p *conn->core
$2 = {config = 0x2c88ea0, cache = 0x2c890c0, prim_comp_no = 0,
state = CORE_CLOSED, proto_ver = 0, send_lock = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, send_buf = 0x2ca2500,
send_buf_len = 32636, send_act_no = 1, recv_msg = {buf = 0x2c924f0,
buf_len = 65536, size = 64, sender_idx = -1, type = GCS_MSG_COMPONENT},
fifo = 0x2c8fac0, group = {cache = 0x2c890c0, act_id = 0, conf_id = -1,
state_uuid = {data = "\320\ri\234\336D\021\343\222\vs\301\276q<E"},
group_uuid = {data = "\036WΔݷ\021㖤\006\002:\310\061\064"}, num = 0,
my_idx = -1, my_name = 0x2c8fbe0 "node1",
my_address = 0x2c8d910 "192.168.16.11:3311",
state = GCS_GROUP_NON_PRIMARY, last_applied = 0, last_node = 0,
frag_reset = true, nodes = 0x0, prim_uuid = {
data = "\272U\370s\336D\021㈀\367\316ˋ\245a"}, prim_seqno = 1,
prim_num = 1, prim_state = GCS_NODE_STATE_SYNCED,
gcs_proto_ver = 0 '\000', repl_proto_ver = 4, appl_proto_ver = 2,
quorum = {group_uuid = {data = "\036WΔݷ\021㖤\006\002:\310\061\064"},
act_id = 0, conf_id = 0, primary = true, version = 2, gcs_proto_ver = 0,
repl_proto_ver = 4, appl_proto_ver = 2}, last_applied_proto_ver = 1},
msg_size = 0, backend = {conn = 0x2cc2000, open = 0x7f5be6b0f979
<gcomm_open(gcs_backend_t*, char const*, bool)>,
close = 0x7f5be6b0fd68 <gcomm_close(gcs_backend_t*)>,
destroy = 0x7f5be6b0e968 <gcomm_destroy(gcs_backend_t*)>,
send = 0x7f5be6b10657 <gcomm_send(gcs_backend_t*, void const*, size_t, gcs_msg_type_t)>,
recv = 0x7f5be6b0f4d9 <gcomm_recv(gcs_backend_t*, gcs_recv_msg_t*, long long)>, name = 0x7f5be6b0e8b0 <gcomm_name()>,
msg_size = 0x7f5be6b0eadc <gcomm_msg_size(gcs_backend_t*, long)>,
param_set = 0x7f5be6b0ebbb <gcomm_param_set(gcs_backend_t*, char const*, char const*)>,
param_get = 0x7f5be6b0e8b8 <gcomm_param_get(gcs_backend_t*, char const*)>}}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.