Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

redis-cluster attempting to connect to a stale redis server IP and not recovering,about nordix/hiredis-cluster

Comments (29)

bjosv commented on June 23, 2024 1

This is legacy behavior and a bit strange, and it would be nice to have it changed.
A disconnect is not always an indication of a change in the slot distribution, but if the slot distribution haven't been updated for a while maybe hiredis-cluster should perform an check/update in this case instead.

from hiredis-cluster.

bjosv commented on June 23, 2024 1

Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().

I would guess the status check is skipped for async.
redisAsyncConnectWithOptions() sets REDIS_OPT_NONBLOCK which results in that REDIS_BLOCK is not set on the hiredis context.
It could be that the connect fail with errno=EINPROGRESS, which is not uncommon.

hiredis net.c

static int _redisContextConnectTcp(redisContext *c, const char *addr, int port,
...
    int blocking = (c->flags & REDIS_BLOCK);
...
        if (connect(s,p->ai_addr,p->ai_addrlen) == -1) {
            if (errno == EHOSTUNREACH) {
                redisNetClose(c);
                continue;
            } else if (errno == EINPROGRESS) {
                if (blocking) {
                    goto wait_for_ready;                <--- skipped
                }

I think this can be indicated and viewed in the hiredis-cluster async tests, I will double-check.

from hiredis-cluster.

bjosv commented on June 23, 2024 1

Just as a note, there is this in redisAsyncInitialize() called just after redisConnectWithOptions() in hiredis redisAsyncConnectWithOptions():
https://github.com/redis/hiredis/blob/6f5bae8c6900e051da6e677756508707565ce56e/async.c#L124-L127

from hiredis-cluster.

bjosv commented on June 23, 2024 1

Exactly what I think, __redisAsyncHandleConnect() should return REDIS_OK, and completed=1, that is the only way to set the REDIS_CONNECTED flag.
So, somehow it managed that.
Could it be that the K8s pod still exists with the redis instance, but its not included in the redis cluster?
Update: Ok, you had a printout of kubectl get pods without it.. yes, strange.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024 1

Ya this flapping went on for atleast 2hrs. So very unlikely that could be the case.
Just thinking if there could be any chance connect() can return back a 0 for a non-success case.

from hiredis-cluster.

bjosv commented on June 23, 2024 1

Just one other thing that I thought of:
You ran the cluster nodes command manually to check the redis cluster, ie the log above with >>> Calling cluster nodes. Did you run this command on each node to check each nodes view of the cluster?

This would just making sure they are all in sync since hiredis-cluster can get its slot distribution from any master node, which might not be the one you send the command to.
Probably unlikely a problem though..but it depends on how your K8s Redis operator removes a node (if it send CLUSTER FORGET to all other nodes).

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024 1

Hi @bjosv

cluster nodes command is run as below on redis-cluster kubernetes service FQDN.

redis-cli --cluster call redis-cluster.local:6379 cluster nodes

It is not run for all nodes separately. OK, I see you think operator could also be involved in this. We are using IBM redis-operator for this.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hi, can someone please take a look at this one? We are a bit blocked with this currently.

from hiredis-cluster.

bjosv commented on June 23, 2024

That sounds strange.
I guess you have checked the individual nodes by running the command CLUSTER NODES (or CLUSTER SLOTS) individually on them to make sure they are all in sync.
Do you set the max retry config to a specific value (using redisClusterSetOptionMaxRetry()) or is it default? (in async API that means 5 errors starts a timer to fetch topology).
Do you have/see any logs about failed command callbacks, which then would trigger a new topology fetch?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hi @bjosv

Yes, cluster nodes command was separately run on the redis-server when issue happened and ouput(captured below) is matching with the kubectl o/p.

>>> Calling cluster nodes
redis-cluster.local:6379: 51c49c0ba2f1c33456dafc1654e4969c8ab25416 192.168.100.3:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519200113 29 connected
6fde1a7b2f6ed7355066534e8a1656f0f9a0e988 192.168.164.92:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198598 35 connected
8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 192.168.228.4:6379@16379 master - 0 1673519198000 35 connected 10924-16383
7061695cb989db63cf08cecf5dcdcee7293a7c4e 192.168.33.99:6379@16379 myself,master - 0 1673519195000 34 connected 2-5463
3cb872f1153c39f9e1b5ce3f1ddf2136a2558bba 192.168.149.14:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519196088 34 connected
6d68da34389cb97e70e4ee8075108f6394847446 192.168.229.229:6379@16379 master - 0 1673519199100 29 connected 0-1 5464-10923
b138088411b13d8db3b855ca454d3f7881065d39 192.168.100.63:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519197089 34 connected
5e99ed9c02ae3daecabbef44d05563f502efc6eb 192.168.124.200:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519195000 29 connected
b2c8cdd82c52c27ecda5c44a121e21ebb5d2406a 192.168.41.246:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198094 35 connected

We dont set redisClusterSetOptionMaxRetry() explicitly. So default value should be set i guess.

Also, I couldn't find any logs which suggests that we ran cluster nodes command, and i suppose if it was run, it would have self resolved.

One more point to add is, when hiredis calls back into application during tcp disconnect, application doesnt do anything with the redisAsyncContext* pointer that is sent back (I mean no delete of object or anything). I hope that is fine.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

detailed log set from begining:

Log file top entries

begining of connection with redis.

{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584874][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584888][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:45:45.904981][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:45:45.905107][connectCallback][114]Connected to 192.168.33.99:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:47:30.263102][disconnectCallback][149]Error: Server closed the connection
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185349][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185751][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:26.186998][CommandCb][1480]Redis command timeout
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:26.189608][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190408][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190434][connectCallback][114]Connected to 192.168.228.26:6379

{"severity":"debug","debug-string":"[2023-01-12T07:48:28.191117][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200001][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200195][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:34.185965][CommandCb][1480] Redis command timeout
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:34.186333][[disconnectCallback][149]Error: Connection reset by peer
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672284][[connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672358][[connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T08:39:36.675527][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596753][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596809][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T08:53:23.599153][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.937986][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.938091][connectCallback][114]Connected to 192.168.228.26:6379
"}

Once connection is established, there are application commands that are triggered. We dont have prints as to when in this sequence they are triggered. There are application commands that are timing out in bwn (CommandCb) Other logs are all connect/disconnect related.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hi @bjosv, You have any thoughts on this?
One other thing which i wanted to understand is that while client is connected to redis-server, and there is a slot redistribution, how will redis-cluster get to know of that? I guess tcp connection won't be broken from server side for slot redistribution, right?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.

__redisClusterGetReplyFromNode()
redis_cluster_command_execute()
redisClusterAsyncRetryCallback()

But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.

from hiredis-cluster.

bjosv commented on June 23, 2024

I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter.
The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs.
One remedy could be to use redisClusterSetOptionMaxRetry(1) to only allow one retry after a MOVED/ASK/error.

Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter. The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs. One remedy could be to use redisClusterSetOptionMaxRetry(1) to only allow one retry after a MOVED/ASK/error.

Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.

Ya there are more than 5. I just pasted logs from top of the logfile. If you see my initial post, it goes on for >2hrs. But one thing to note here is that its not actually a response coming from the server which the application is logging, we have configured for command_timeout of 2sec, and since no response is seen for that period, there is a timeout happening which is getting logged by the application with CommandCb tag.

And as i mentioned above, the places where we check for max_retry_count is mostly getting skipped as we never go through one of those flows.

Edit: There is no response from the IP because that IP(192.168.228.26) is no longer valid. So obviously all the commands will see a timeout eventually.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hi @bjosv do you agree with the above observation?
Is so, is there a way to introduce max_retry_count check in the flow it's taking now?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.
__redisClusterGetReplyFromNode()
redis_cluster_command_execute()
redisClusterAsyncRetryCallback()
But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.

Hi @bjosv Correcting my statement above:

redisClusterAsyncRetryCallback() is the callback we pass to hiredis, so it should get called back for each command that is attempted.
So, whenever hiredis api redisAsyncHandleTimeout() gets called during command_timeout, hiredis sets the error to REDIS_ERR_TIMEOUT and callsback into hiredis-cluster with NULL reply. As part of this max_retry_count should have got incremented. So, I tend to agree with your statement.

But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

One more query in the same context. I was also wondering why a connect() call would succeed in such a scenario?

If the remote ip is not reachable shouldn't the connect() itself fail? The fact that application is able to print connectCallback tagged messages is an indication that the call succeeded with REDIS_OK. Do you have any thoughts on this @bjosv?

from hiredis-cluster.

bjosv commented on June 23, 2024

But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.

Yes, I believe it was done in this way to handle heavy traffic scenarios, giving an imperfect behavior for low traffic scenarios..

I was also wondering why a connect() call would succeed in such a scenario?

Since the async api is used the connect will only register to the eventsystem, which normally wont fail.
When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR.
Do you get a REDIS_OK in your case?

from hiredis-cluster.

bjosv commented on June 23, 2024

Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.

Do you have authentication enabled on Redis, using password or username/password?
How about TLS?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hi @bjosv

Do you get a REDIS_OK in your case?

Yes, I keep getting REDIS_OK in the connectCallbacks.

When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR.

Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().

Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.

But the IP is no longer present in the cluster. I was thinking there is some issue here due to which its wrongly claiming it's connecting.

Do you have authentication enabled on Redis, using password or username/password?
How about TLS?

There is no password authentication/TLS used for now.

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Hmm. Yes, I guess that definitely looks like a possible gap.

However I was checking more in that flow, I see that once we come out of that else if block, we also set the flag to REDIS_CONNECTED.

c->flags |= REDIS_CONNECTED;

And when i searched for places where hiredis could callback onConnect() api with success only 1 search result came up.
which is inside __redisAsyncHandleConnect():

https://github.com/redis/hiredis/blob/7583ebb1b271e7cb8e44f8211d02bd719d4a4bd1/async.c#L672

redisCheckConnectDone() internally does a connect() on the socket and if return code is 0 then only it should invoke the line 627. --> Will connect() continue to be non-blocking mode here ?

Also, __redisAsyncHandleConnect() is called only if flags don't have REDIS_CONNECTED set.

653     if (!(c->flags & REDIS_CONNECTED)) {
654         /* Abort connect was not successful. */
655         if (__redisAsyncHandleConnect(ac) != REDIS_OK)
656             return;

But if errno == EINPROGRESS block had got hit, the flag would have been already set, and we would never invoke the onConnect() callback.

I'm thinking now is __redisAsyncHandleConnect() at fault here ?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Ya, I suppose I missed noting this. :)

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Ok, still trying to understand where would be the REDIS_OK connectedCallback be called from.

I guess when there is a redisAsyncHandleWrite() or a redisAsyncHandleRead() operation, it will invoke __redisAsyncHandleConnect() which inturn calls connect() and connect should returns back 0 ? This is the only place where it could invoke connected callback with REDIS_OK right ?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

One more thing i was thinking is, since the IP is not existing, wouldn't number of connectCallback prints translate to number of commands attempted towards the node ? I could see its definitely more than 5.

from hiredis-cluster.

bjosv commented on June 23, 2024

There should be a connectCallback (connect attempt) for each CommandCb that timed out towards IP x.26.
Do you see more than 5 timed out CommandCb's?

from hiredis-cluster.

rahulKrishnaM commented on June 23, 2024

Yes, in the logfile i am looking at i could see in this time interval application Timeout error has occurred for 7 times.

from hiredis-cluster.

bjosv commented on June 23, 2024

Which operator are you using btw?
KubeDB, AmadeusITGroup/Redis-Operator/.. or own?

from hiredis-cluster.

bjosv commented on June 23, 2024

The root problem that a timeout was not triggering a slotmap update is corrected by #144.
The reason for hiredis to be able to connect to a cluster-node that shouldn't be available looks like a Redis operator issue.
Please reopen with more info if that problem can be blamed on hiredis-cluster.

from hiredis-cluster.

redis-cluster attempting to connect to a stale redis server IP and not recovering about hiredis-cluster HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent