Comments (13)
Hi Amir!
Are you suggesting a failed reconnect should trigger update slot mapping?
from hiredis-cluster.
If there are many queries, the slot mapping will be updated and the issue will resolve itself after a while, but it's better to update slot mapping at failed reconnect too. Thanks for the idea!
We'll accept a PR. Maybe we'll do it ourselves, but I'm not sure when we'll have time.
from hiredis-cluster.
@zuiderkwast thanks for answering so quickly!
Failed reconnect is not exactly the description - in this case we have other connected nodes - and some are failing. This does sound like the situation where we can still connect to the configured host and let the cluster node mapping take over.
I will check if we can devote some time for a PR :)
from hiredis-cluster.
Great. It would also help to have an accurate description of exactly what should trigger updating the slot mapping.
from hiredis-cluster.
We will analyze it more throughly and add more info tomorrow. Here's a breakdown of possible cases:
When can we fail when running a command? (breaking down redis_cluster_command_execute
- feel free to correct me if I'm wrong here)
- Bad config (host/port, ssl, etc.)
- Never connected to any node
- Bad reply from
cluster nodes
orcluster slots
command (i.e. wrong addresses)
- Network issues
- Slow network reaching timeout
- Disconnected client
- Node has gone away (cluster failover to replica)
- Node connection will close, driver sends the command to arbitrary node, reply should be
MOVED
which will trigger reconnect.
- Node has gone away (scale down), driver was connected before to the specific node
- Like the previous option, with a chance to successfully run commands on the arbitrary node since it might have gotten the relevant shard (will not trigger a reconnect).
- Node has gone away (scale down), driver was not connected before to the specific node
- connection fails, connection returns NULL - goto error.
Our issue stems from the last option. The driver was connected to at least one node but not to the node that was scaled down. After the scale down it would try to connect to the node that no longer exists - reaching an edge case in the driver that would not try to re-map the cluster (as far as I can tell from the code). You would need a relatively low-traffic application instance.
@zuiderkwast If you would like to discuss further, we can also schedule a Webex (or other) meeting.
Thanks
from hiredis-cluster.
Yes, I think your analysis makes sense. As you might know, we have taken over the maintenance so we are not fully aware of why things are the way they are. This is what I think when I read through redis_cluster_command_execute
:
node = node_get_by_table(cc, (uint32_t)command->slot_num);
if (node == NULL) {
__redisClusterSetError(cc, REDIS_ERR_OTHER, "node get by table error");
return NULL;
}
This means the slot is not covered in the cluster. Maybe we should update the routes in this case too? But maybe not every time because it may not help, but only if we didn't do it within some period of time.
c = ctx_get_by_node(cc, node);
if (c == NULL) {
return NULL;
} else if (c->err) {
node = node_get_which_connected(cc);
As you mentioned, if connect or reconnect fails, ctx_get_by_node
returns NULL. Maybe we should fall through to the else
and send the command to a random node here? (node_get_which_connected
returns an arbitrary node.) If the command is sent to a random node, we'll get a MOVED redirect and then we'll update the routes.
I don't know how c->err
can ever be true here. After ctx_get_by_node
it seems impossible. Maybe it's a mistake. WDYT?
If we change this, should we change the pipelining functions (redisClusterAppendCommand family) and the async API functions (redisClusterAsyncCommand, actx_get_by_node) too in the same way? I'm not sure it's possible to fallback to a random node in these cases. @bjosv are you familiar with this code?
Some test for these scenarios would be useful. :-)
from hiredis-cluster.
I don't know how
c->err
can ever be true here. Afterctx_get_by_node
it seems impossible. Maybe it's a mistake. WDYT?
My guess is that if the hiredis function redisReconnect()
in ctx_get_by_node()
fails c->err
would be true.
I'm no sure of the benefits of reusing the hiredis context via redisReconnect()
instead of starting from scratch via redisConnect()
here.
If we change this, should we change the pipelining functions (redisClusterAppendCommand family) and the async API functions (redisClusterAsyncCommand, actx_get_by_node) too in the same way? I'm not sure it's possible to fallback to a random node in these cases. @bjosv are you familiar with this code?
It should be possible to fallback to a random node here as well. The async API have similar actions before sending commands and its mostly the responses that are handled different via the callback code.
It would be nice to have a more common handling of errors since the async seem to have its own additional band-aid
in its internal callback (which forwards to the user-callback or handles retires):
static void redisClusterAsyncRetryCallback(redisAsyncContext *ac, void *r,
void *privdata) {
....
// Note:
// I can't decide which is the best way to deal with connect
// problem for hiredis cluster async api.
// But now the way is : when enough null reply for a node,
// we will update the route after the cluster node timeout.
// If you have a better idea, please contact with me. Thank you.
// My email: [email protected]
from hiredis-cluster.
..As you might know, we have taken over the maintenance so we are not fully aware of why things are the way they are. ..
Thank you for your work for the community :) not many projects of this size have people familiar with every part of it, so no worries.
This means the slot is not covered in the cluster. Maybe we should update the routes in this case too?...
You are right, but I'm not sure this is a possibility if we already have a connected cluster context unless we somehow got bad inputs (like a slot number over the limit or a partial node mapping parse results).
I don't know how c->err can ever be true here. After
ctx_get_by_node
it seems impossible. Maybe it's a mistake. WDYT?
I think the redis connection could have an error from a previous call. In a scale down scenario for a connected client, it might be a socket close error after trying to read from it (educated guess).
As you mentioned, if connect or reconnect fails,
ctx_get_by_node
returnsNULL
. Maybe we should fall through to the else and send the command to a random node here? (node_get_which_connected
returns an arbitrary node.) If the command is sent to a random node, we'll get aMOVED
redirect and then we'll update the routes.
It's a good option, but consider a scale down from two nodes to one node in the cluster - the driver will keep going to a "randomly" selected node - the only one left will have all the shards in the cluster. We will not get a MOVED
reply and cluster reroute will not trigger.
This is not really terrible since in node_get_which_connected
we only run the ping
command if the connection exists and has no error - so a max of one ping
until we dismiss the node without a real performance degradation. However keeping a weird state in the driver might have other issues. Among them could be losing support for pipelining or other special command flows if they do not have the same fallback scenario.
Also I was going to go into detail with the AWS scale down scenario (DNS/address behavior, etc.), is it still necessary? I feel like the root of the problem has been communicated :)
from hiredis-cluster.
if the hiredis function
redisReconnect()
inctx_get_by_node()
failsc->err
would be true
@bjosv Right, connect/reconnect is the difference between 5 and 3-4 in @Moo64c's list. It's weird that the function returns NULL in one case and c->err in the other.
I'm no sure of the benefits of reusing the hiredis context via
redisReconnect()
instead of starting from scratch viaredisConnect()
here.
redisReconnect (hiredis) looks like a small optimization compared to doing redisFree() followed by redisConnect(). Some allocations are reused, but that's nothing compared to what a reconnect costs. I'm fine with scrapping reconnect. Or we can just make sure we handle reconnect errors the same way as connect errors.
from hiredis-cluster.
consider a scale down from two nodes to one node in the cluster - the driver will keep going to a "randomly" selected node - the only one left will have all the shards in the cluster. We will not get a MOVED reply and cluster reroute will not trigger.
@Moo64c A Redis cluster must have 3 masters or more, at least when you start it. I don't think it's possible to scale down below 3.
Either way, we don't have to wait for the MOVED. We can update the slot routes when we get a reconnect failure. I'm fine with either solution.
from hiredis-cluster.
Yeah, so failing to connect to a node that we got from a cluster mapping should trigger the slot routing (second solution you suggested). It is a clear indication something changed.
I've also verified with our OPS team - it is possible to go down to a single master (in our case, with one replica). Of course this is a tradeoff in other parameters, but cost is a major one :)
from hiredis-cluster.
Hi :)
First, thank you for your work
We are experiencing this issue as well, do you know when this fix will be added to the driver?
from hiredis-cluster.
We have #87 to mitigate this problem, and we aim to get the fix in soon (maybe with more tests..).
from hiredis-cluster.
Related Issues (20)
- During cmake, Checking for module 'glib-2.0' No package 'glib-2.0' found HOT 2
- nodeNext strange behaviour after master failover. HOT 9
- With redisClusterAsyncCommandArgvToNode api, client not discovering new nodes on redis master node disconnect HOT 4
- Update the slotmap after send errors/timeouts in the sync API HOT 1
- hircluster.c:218:9: 错误: HOT 4
- Failed to find keys of command XREADGROUP HOT 2
- Recovery in case of discovered slots from redis cluster is partial HOT 2
- is redisClusterAppendCommand, redisClusterGetReply thread Safe? HOT 1
- SIGSEGV in clusterNodesReplyCallback while calling redisClusterAsyncFree HOT 3
- The connection always been close after every redisClusterCommandArgv HOT 1
- Unable to execute commands without keys (FLUSHALL, PING, INFO) HOT 2
- redisClusterAsyncDisconnect behaviour when a redisAsyncContext object has seen some error HOT 1
- symbol lookup error: /usr/local/lib/libhiredis_cluster.so.0.13: undefined symbol: hiredisAllocFns HOT 4
- Does hiredis-cluster support SCAN HOT 1
- Event loop occasionally hangs after redisClusterAsyncDisconnect under high connection error conditions HOT 2
- Crash while performing redisClusterAsyncFree() HOT 3
- memory leak HOT 1
- Cannot pass command with va_list to specific node HOT 2
- Keep init-nodes as fallback if all nodes fail
- Feature request for using hostname to move and connect to individual node in cluster mode HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hiredis-cluster.