Comments (29)
This is legacy behavior and a bit strange, and it would be nice to have it changed.
A disconnect is not always an indication of a change in the slot distribution, but if the slot distribution haven't been updated for a while maybe hiredis-cluster should perform an check/update in this case instead.
from hiredis-cluster.
Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().
I would guess the status check is skipped for async.
redisAsyncConnectWithOptions()
sets REDIS_OPT_NONBLOCK
which results in that REDIS_BLOCK
is not set on the hiredis context.
It could be that the connect fail with errno=EINPROGRESS, which is not uncommon.
hiredis net.c
static int _redisContextConnectTcp(redisContext *c, const char *addr, int port,
...
int blocking = (c->flags & REDIS_BLOCK);
...
if (connect(s,p->ai_addr,p->ai_addrlen) == -1) {
if (errno == EHOSTUNREACH) {
redisNetClose(c);
continue;
} else if (errno == EINPROGRESS) {
if (blocking) {
goto wait_for_ready; <--- skipped
}
I think this can be indicated and viewed in the hiredis-cluster async tests, I will double-check.
from hiredis-cluster.
Just as a note, there is this in redisAsyncInitialize()
called just after redisConnectWithOptions()
in hiredis redisAsyncConnectWithOptions()
:
https://github.com/redis/hiredis/blob/6f5bae8c6900e051da6e677756508707565ce56e/async.c#L124-L127
from hiredis-cluster.
Exactly what I think, __redisAsyncHandleConnect()
should return REDIS_OK, and completed=1, that is the only way to set the REDIS_CONNECTED
flag.
So, somehow it managed that.
Could it be that the K8s pod still exists with the redis instance, but its not included in the redis cluster?
Update: Ok, you had a printout of kubectl get pods
without it.. yes, strange.
from hiredis-cluster.
Ya this flapping went on for atleast 2hrs. So very unlikely that could be the case.
Just thinking if there could be any chance connect() can return back a 0 for a non-success case.
from hiredis-cluster.
Just one other thing that I thought of:
You ran the cluster nodes
command manually to check the redis cluster, ie the log above with >>> Calling cluster nodes
. Did you run this command on each node to check each nodes view of the cluster?
This would just making sure they are all in sync since hiredis-cluster can get its slot distribution from any master node, which might not be the one you send the command to.
Probably unlikely a problem though..but it depends on how your K8s Redis operator removes a node (if it send CLUSTER FORGET
to all other nodes).
from hiredis-cluster.
Hi @bjosv
cluster nodes command is run as below on redis-cluster kubernetes service FQDN.
redis-cli --cluster call redis-cluster.local:6379 cluster nodes
It is not run for all nodes separately. OK, I see you think operator could also be involved in this. We are using IBM redis-operator for this.
from hiredis-cluster.
Hi, can someone please take a look at this one? We are a bit blocked with this currently.
from hiredis-cluster.
That sounds strange.
I guess you have checked the individual nodes by running the command CLUSTER NODES
(or CLUSTER SLOTS
) individually on them to make sure they are all in sync.
Do you set the max retry config to a specific value (using redisClusterSetOptionMaxRetry()
) or is it default? (in async API that means 5 errors starts a timer to fetch topology).
Do you have/see any logs about failed command callbacks, which then would trigger a new topology fetch?
from hiredis-cluster.
Hi @bjosv
Yes, cluster nodes command was separately run on the redis-server when issue happened and ouput(captured below) is matching with the kubectl o/p.
>>> Calling cluster nodes
redis-cluster.local:6379: 51c49c0ba2f1c33456dafc1654e4969c8ab25416 192.168.100.3:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519200113 29 connected
6fde1a7b2f6ed7355066534e8a1656f0f9a0e988 192.168.164.92:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198598 35 connected
8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 192.168.228.4:6379@16379 master - 0 1673519198000 35 connected 10924-16383
7061695cb989db63cf08cecf5dcdcee7293a7c4e 192.168.33.99:6379@16379 myself,master - 0 1673519195000 34 connected 2-5463
3cb872f1153c39f9e1b5ce3f1ddf2136a2558bba 192.168.149.14:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519196088 34 connected
6d68da34389cb97e70e4ee8075108f6394847446 192.168.229.229:6379@16379 master - 0 1673519199100 29 connected 0-1 5464-10923
b138088411b13d8db3b855ca454d3f7881065d39 192.168.100.63:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519197089 34 connected
5e99ed9c02ae3daecabbef44d05563f502efc6eb 192.168.124.200:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519195000 29 connected
b2c8cdd82c52c27ecda5c44a121e21ebb5d2406a 192.168.41.246:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198094 35 connected
We dont set redisClusterSetOptionMaxRetry() explicitly. So default value should be set i guess.
Also, I couldn't find any logs which suggests that we ran cluster nodes command, and i suppose if it was run, it would have self resolved.
One more point to add is, when hiredis calls back into application during tcp disconnect, application doesnt do anything with the redisAsyncContext* pointer that is sent back (I mean no delete of object or anything). I hope that is fine.
from hiredis-cluster.
detailed log set from begining:
Log file top entries
begining of connection with redis.
{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584874][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584888][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:45:45.904981][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:45:45.905107][connectCallback][114]Connected to 192.168.33.99:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:47:30.263102][disconnectCallback][149]Error: Server closed the connection
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185349][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185751][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.186998][CommandCb][1480]Redis command timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.189608][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190408][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190434][connectCallback][114]Connected to 192.168.228.26:6379
{"severity":"debug","debug-string":"[2023-01-12T07:48:28.191117][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200001][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200195][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:34.185965][CommandCb][1480] Redis command timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:34.186333][[disconnectCallback][149]Error: Connection reset by peer
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672284][[connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672358][[connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:36.675527][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596753][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596809][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:23.599153][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.937986][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.938091][connectCallback][114]Connected to 192.168.228.26:6379
"}
Once connection is established, there are application commands that are triggered. We dont have prints as to when in this sequence they are triggered. There are application commands that are timing out in bwn (CommandCb) Other logs are all connect/disconnect related.
from hiredis-cluster.
Hi @bjosv, You have any thoughts on this?
One other thing which i wanted to understand is that while client is connected to redis-server, and there is a slot redistribution, how will redis-cluster get to know of that? I guess tcp connection won't be broken from server side for slot redistribution, right?
from hiredis-cluster.
One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.
__redisClusterGetReplyFromNode()
redis_cluster_command_execute()
redisClusterAsyncRetryCallback()
But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.
from hiredis-cluster.
I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter.
The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs.
One remedy could be to use redisClusterSetOptionMaxRetry(1)
to only allow one retry after a MOVED/ASK/error.
Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.
from hiredis-cluster.
I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter. The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs. One remedy could be to use
redisClusterSetOptionMaxRetry(1)
to only allow one retry after a MOVED/ASK/error.Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.
Ya there are more than 5. I just pasted logs from top of the logfile. If you see my initial post, it goes on for >2hrs. But one thing to note here is that its not actually a response coming from the server which the application is logging, we have configured for command_timeout of 2sec, and since no response is seen for that period, there is a timeout happening which is getting logged by the application with CommandCb tag.
And as i mentioned above, the places where we check for max_retry_count is mostly getting skipped as we never go through one of those flows.
Edit: There is no response from the IP because that IP(192.168.228.26) is no longer valid. So obviously all the commands will see a timeout eventually.
from hiredis-cluster.
Hi @bjosv do you agree with the above observation?
Is so, is there a way to introduce max_retry_count check in the flow it's taking now?
from hiredis-cluster.
One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.
__redisClusterGetReplyFromNode() redis_cluster_command_execute() redisClusterAsyncRetryCallback()
But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.
Hi @bjosv Correcting my statement above:
redisClusterAsyncRetryCallback()
is the callback we pass to hiredis, so it should get called back for each command that is attempted.
So, whenever hiredis api redisAsyncHandleTimeout() gets called during command_timeout, hiredis sets the error to REDIS_ERR_TIMEOUT and callsback into hiredis-cluster with NULL reply. As part of this max_retry_count should have got incremented. So, I tend to agree with your statement.
But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.
from hiredis-cluster.
One more query in the same context. I was also wondering why a connect() call would succeed in such a scenario?
If the remote ip is not reachable shouldn't the connect() itself fail? The fact that application is able to print connectCallback tagged messages is an indication that the call succeeded with REDIS_OK. Do you have any thoughts on this @bjosv?
from hiredis-cluster.
But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.
Yes, I believe it was done in this way to handle heavy traffic scenarios, giving an imperfect behavior for low traffic scenarios..
I was also wondering why a connect() call would succeed in such a scenario?
Since the async api is used the connect will only register to the eventsystem, which normally wont fail.
When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR
.
Do you get a REDIS_OK
in your case?
from hiredis-cluster.
Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.
Do you have authentication enabled on Redis, using password or username/password?
How about TLS?
from hiredis-cluster.
Hi @bjosv
Do you get a REDIS_OK in your case?
Yes, I keep getting REDIS_OK in the connectCallbacks.
When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR.
Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().
Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.
But the IP is no longer present in the cluster. I was thinking there is some issue here due to which its wrongly claiming it's connecting.
Do you have authentication enabled on Redis, using password or username/password?
How about TLS?
There is no password authentication/TLS used for now.
from hiredis-cluster.
Hmm. Yes, I guess that definitely looks like a possible gap.
However I was checking more in that flow, I see that once we come out of that else if block, we also set the flag to REDIS_CONNECTED.
c->flags |= REDIS_CONNECTED;
And when i searched for places where hiredis could callback onConnect() api with success only 1 search result came up.
which is inside __redisAsyncHandleConnect():
https://github.com/redis/hiredis/blob/7583ebb1b271e7cb8e44f8211d02bd719d4a4bd1/async.c#L672
redisCheckConnectDone() internally does a connect() on the socket and if return code is 0 then only it should invoke the line 627. --> Will connect() continue to be non-blocking mode here ?
Also, __redisAsyncHandleConnect() is called only if flags don't have REDIS_CONNECTED set.
653 if (!(c->flags & REDIS_CONNECTED)) {
654 /* Abort connect was not successful. */
655 if (__redisAsyncHandleConnect(ac) != REDIS_OK)
656 return;
But if errno == EINPROGRESS block had got hit, the flag would have been already set, and we would never invoke the onConnect() callback.
I'm thinking now is __redisAsyncHandleConnect() at fault here ?
from hiredis-cluster.
Ya, I suppose I missed noting this. :)
from hiredis-cluster.
Ok, still trying to understand where would be the REDIS_OK connectedCallback be called from.
I guess when there is a redisAsyncHandleWrite() or a redisAsyncHandleRead() operation, it will invoke __redisAsyncHandleConnect() which inturn calls connect() and connect should returns back 0 ? This is the only place where it could invoke connected callback with REDIS_OK right ?
from hiredis-cluster.
One more thing i was thinking is, since the IP is not existing, wouldn't number of connectCallback prints translate to number of commands attempted towards the node ? I could see its definitely more than 5.
from hiredis-cluster.
There should be a connectCallback (connect attempt) for each CommandCb that timed out towards IP x.26.
Do you see more than 5 timed out CommandCb's?
from hiredis-cluster.
Yes, in the logfile i am looking at i could see in this time interval application Timeout error has occurred for 7 times.
from hiredis-cluster.
Which operator are you using btw?
KubeDB, AmadeusITGroup/Redis-Operator/.. or own?
from hiredis-cluster.
The root problem that a timeout was not triggering a slotmap update is corrected by #144.
The reason for hiredis to be able to connect to a cluster-node that shouldn't be available looks like a Redis operator issue.
Please reopen with more info if that problem can be blamed on hiredis-cluster.
from hiredis-cluster.
Related Issues (20)
- Behaviour when all redis nodes go down simultaneously. HOT 2
- When is the support for redis 7 coming? Any recommendation for which redis client to use for redis version 7? HOT 11
- Support custom commands defined in Redis modules HOT 6
- Testing fault tolerance HOT 1
- There is no way to perform AUTH in asynchronous mode. HOT 2
- Does hiredis-cluster support SSUBSCRIBE? HOT 2
- Compilation error LIBEVENT_INCLUDES-NOTFOUND HOT 3
- MOVED ERROR in pipeline-API when increasing shards in existing redis cluster HOT 10
- Target "hiredis::hiredis" because target "hiredis" is not a library HOT 1
- During cmake, Checking for module 'glib-2.0' No package 'glib-2.0' found HOT 2
- nodeNext strange behaviour after master failover. HOT 9
- With redisClusterAsyncCommandArgvToNode api, client not discovering new nodes on redis master node disconnect HOT 4
- Update the slotmap after send errors/timeouts in the sync API HOT 1
- hircluster.c:218:9: 错误: HOT 4
- Failed to find keys of command XREADGROUP HOT 2
- Recovery in case of discovered slots from redis cluster is partial HOT 2
- is redisClusterAppendCommand, redisClusterGetReply thread Safe? HOT 1
- SIGSEGV in clusterNodesReplyCallback while calling redisClusterAsyncFree HOT 3
- The connection always been close after every redisClusterCommandArgv HOT 1
- Unable to execute commands without keys (FLUSHALL, PING, INFO) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hiredis-cluster.