[Improvement] Race condition for being selectable when some nodes in the cluster are down #2520

Matiszak · 2023-08-06T20:11:46Z

Context:
We have a cluster consisting of 3 masters + 3 replicas (1 replica for each master). We are simulating situation when some of our cluster is down (in this example we shut down 3 replicas). Our .NET clients have all of the 6 nodes put into configuration. Also option "resolveDns=true" is set. (resolveDns is not crucial here, it's just because nodes know themselves through their ips so that I set it to have not duplicated connections by hostname and by ip after discovery).

When some of the nodes in the cluster are down they are not responsive. So they exhaust default connection timeout of 5000 milliseconds. They exhaust even much higher connection timeouts (tens of seconds). It was noticed in previous versions of StackExchange.Redis (<2.2.xx) that this behavior caused problems because after .Connect() randomly some of the nodes were unselectable by reason DidNotRespond. Upgrading to 2.6.xx mostly solved this issue (for which im glad) due to queueing feature. But I noticed that the underlying issue is still there and in some scenarios might cause problems so that's the reason I'm posting this. I found one use case in which it manifests but you might be aware of more.

Issue:

The first iteration of connecting to nodes takes 5 seconds and connects to 3 healthy nodes but fails to connect to remaining 3 wrong ones.
Second iteration issues PINGs to healthy nodes because of sendTracerIfConnected: true in server.OnConnectedAsync(log, sendTracerIfConnected: true, autoConfigureIfConnected: reconfigureAll);
Since first iteration exhaused timeout the second iteration exits immediately without waiting
For a brief moment after .Connect() some of the healthy nodes are undelectable even though connection succeeded in the first iteration
It causes issues in some scenarios (e.g SET with BacklogPolicy = FailFast

Reproduction code:

var options = ConfigurationOptions.Parse("server1:6116,server2:6116,server3:6116,server4:6116,server5:6116,server6:6116,password=some_password,resolveDns=true");
options.BacklogPolicy = BacklogPolicy.FailFast;
var connection = ConnectionMultiplexer.Connect(options, new LoggerWriter());
var status = connection.GetStatus();
connection.GetDatabase(0).StringSet("test2", "test2");

Exception:

No connection (requires writable - not eligible for replica) is active/available to service this operation: SET test2; It was not possible to connect to the redis server(s). Error connecting right now. To allow this multiplexer to continue retrying until it's able to connect, use abortConnect=false in your connection string or AbortOnConnectFail=false; in your code. ConnectTimeout, mc: 1/1/0, mgr: 10 of 10 available, clientName: MACHINENAME(SE.Redis-v2.6.122.38350), PerfCounterHelperkeyHashSlot: 8899, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=0,Free=32767,Min=16,Max=32767), POOL: (Threads=23,QueuedItems=0,CompletedItems=526,Timers=17), v: 2.6.122.38350

Log:

20:04:14.2643: Connecting (sync) on .NET 6.0.19 (StackExchange.Redis: v2.6.122.38350)
20:04:14.2770: server01.redacted-domain.com:6116,server02.redacted-domain.com:6116,server03.redacted-domain.com:6116,server04.redacted-domain.com:6116,server05.redacted-domain.com:6116,server06.redacted-domain.com:6116,password=*****,resolveDns=True
20:04:14.2782: Using DNS to resolve 'server01.redacted-domain.com'...
20:04:14.2801: 'server01.redacted-domain.com' => 10.0.0.36
20:04:14.2806: Using DNS to resolve 'server02.redacted-domain.com'...
20:04:14.2863: 'server02.redacted-domain.com' => 10.0.0.37
20:04:14.2868: Using DNS to resolve 'server03.redacted-domain.com'...
20:04:14.2874: 'server03.redacted-domain.com' => 10.0.0.38
20:04:14.2882: Using DNS to resolve 'server04.redacted-domain.com'...
20:04:14.2888: 'server04.redacted-domain.com' => 10.0.0.39
20:04:14.2889: Using DNS to resolve 'server05.redacted-domain.com'...
20:04:14.2895: 'server05.redacted-domain.com' => 10.0.0.40
20:04:14.2897: Using DNS to resolve 'server06.redacted-domain.com'...
20:04:14.2903: 'server06.redacted-domain.com' => 10.0.0.41
20:04:14.2906: 10.0.0.36:6116/Interactive: Connecting...
20:04:14.2911: 10.0.0.36:6116: BeginConnectAsync
20:04:14.2921: 10.0.0.37:6116/Interactive: Connecting...
20:04:14.2925: 10.0.0.37:6116: BeginConnectAsync
20:04:14.2933: 10.0.0.38:6116/Interactive: Connecting...
20:04:14.2936: 10.0.0.38:6116: BeginConnectAsync
20:04:14.2944: 10.0.0.39:6116/Interactive: Connecting...
20:04:14.2947: 10.0.0.39:6116: BeginConnectAsync
20:04:14.2955: 10.0.0.40:6116/Interactive: Connecting...
20:04:14.2958: 10.0.0.40:6116: BeginConnectAsync
20:04:14.2966: 10.0.0.41:6116/Interactive: Connecting...
20:04:14.2969: 10.0.0.41:6116: BeginConnectAsync
20:04:14.2977: 6 unique nodes specified (with tiebreaker)
20:04:14.2983: 10.0.0.36:6116: OnConnectedAsync init (State=Connecting)
20:04:14.2984: 10.0.0.37:6116: OnConnectedAsync init (State=Connecting)
20:04:14.2987: 10.0.0.38:6116: OnConnectedAsync init (State=Connecting)
20:04:14.2989: 10.0.0.39:6116: OnConnectedAsync init (State=Connecting)
20:04:14.2991: 10.0.0.40:6116: OnConnectedAsync init (State=Connecting)
20:04:14.3012: 10.0.0.41:6116: OnConnectedAsync init (State=Connecting)
20:04:14.3019: Allowing 6 endpoint(s) 00:00:05 to respond...
20:04:14.3033: Awaiting 6 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=1,Free=32766,Min=16,Max=32767), POOL: (Threads=19,QueuedItems=0,CompletedItems=356,Timers=25)
20:04:14.3187: 10.0.0.41:6116/Interactive: Connected
20:04:14.3199: 10.0.0.41:6116: Server handshake
20:04:14.3227: 10.0.0.39:6116/Interactive: Connected
20:04:14.3233: 10.0.0.41:6116: Authenticating (password)
20:04:14.3260: 10.0.0.39:6116: Server handshake
20:04:14.3265: 10.0.0.39:6116: Authenticating (password)
20:04:14.3271: 10.0.0.41:6116: Setting client name: MACHINENAME(SE.Redis-v2.6.122.38350)
20:04:14.3276: 10.0.0.39:6116: Setting client name: MACHINENAME(SE.Redis-v2.6.122.38350)
20:04:14.3278: 10.0.0.40:6116/Interactive: Connected
20:04:14.3281: 10.0.0.40:6116: Server handshake
20:04:14.3283: 10.0.0.40:6116: Authenticating (password)
20:04:14.3285: 10.0.0.41:6116: Setting client lib/ver
20:04:14.3288: 10.0.0.39:6116: Setting client lib/ver
20:04:14.3290: 10.0.0.40:6116: Setting client name: MACHINENAME(SE.Redis-v2.6.122.38350)
20:04:14.3293: 10.0.0.41:6116: Auto-configuring...
20:04:14.3295: 10.0.0.39:6116: Auto-configuring...
20:04:14.3297: 10.0.0.40:6116: Setting client lib/ver
20:04:14.3300: 10.0.0.41:6116: Requesting tie-break (Key="__Booksleeve_TieBreak")...
20:04:14.3302: 10.0.0.39:6116: Requesting tie-break (Key="__Booksleeve_TieBreak")...
20:04:14.3305: 10.0.0.40:6116: Auto-configuring...
20:04:14.3308: 10.0.0.41:6116/Interactive: Writing: GET __Booksleeve_TieBreak
20:04:14.3310: 10.0.0.39:6116/Interactive: Writing: GET __Booksleeve_TieBreak
20:04:14.3312: 10.0.0.40:6116: Requesting tie-break (Key="__Booksleeve_TieBreak")...
20:04:14.3334: 10.0.0.41:6116: Sending critical tracer (handshake): ECHO
20:04:14.3356: 10.0.0.39:6116: Sending critical tracer (handshake): ECHO
20:04:14.3359: 10.0.0.39:6116/Interactive: Writing: ECHO
20:04:14.3362: 10.0.0.41:6116/Interactive: Writing: ECHO
20:04:14.3365: 10.0.0.41:6116: Flushing outbound buffer
20:04:14.3367: 10.0.0.39:6116: Flushing outbound buffer
20:04:14.3370: 10.0.0.39:6116: OnEstablishingAsync complete
20:04:14.3373: 10.0.0.39:6116: Starting read
20:04:14.3377: 10.0.0.40:6116/Interactive: Writing: GET __Booksleeve_TieBreak
20:04:14.3379: 10.0.0.41:6116: OnEstablishingAsync complete
20:04:14.3382: 10.0.0.40:6116: Sending critical tracer (handshake): ECHO
20:04:14.3385: 10.0.0.41:6116: Starting read
20:04:14.3388: 10.0.0.40:6116/Interactive: Writing: ECHO
20:04:14.3391: 10.0.0.40:6116: Flushing outbound buffer
20:04:14.3394: 10.0.0.40:6116: OnEstablishingAsync complete
20:04:14.3398: 10.0.0.40:6116: Starting read
20:04:14.3476: 10.0.0.41:6116: Auto-configured (CONFIG) read-only replica: true
20:04:14.3486: 10.0.0.39:6116: Auto-configured (CONFIG) read-only replica: true
20:04:14.3487: 10.0.0.40:6116: Auto-configured (CONFIG) read-only replica: true
20:04:14.3512: 10.0.0.41:6116: Auto-configured (CONFIG) databases: 16
20:04:14.3516: 10.0.0.39:6116: Auto-configured (CONFIG) databases: 16
20:04:14.3519: 10.0.0.40:6116: Auto-configured (CONFIG) databases: 16
20:04:14.3524: 10.0.0.41:6116: Auto-configured (INFO) role: primary
20:04:14.3532: 10.0.0.39:6116: Auto-configured (INFO) role: primary
20:04:14.3536: 10.0.0.40:6116: Auto-configured (INFO) role: primary
20:04:14.3540: 10.0.0.41:6116: Auto-configured (INFO) version: 5.0.8
20:04:14.3545: 10.0.0.39:6116: Auto-configured (INFO) version: 5.0.8
20:04:14.3548: 10.0.0.40:6116: Auto-configured (INFO) version: 5.0.8
20:04:14.3552: 10.0.0.41:6116: Auto-configured (INFO) server-type: cluster
20:04:14.3556: 10.0.0.39:6116: Auto-configured (INFO) server-type: cluster
20:04:14.3558: 10.0.0.40:6116: Auto-configured (INFO) server-type: cluster
20:04:14.3565: Response from 10.0.0.40:6116/Interactive / GET __Booksleeve_TieBreak: (null)
20:04:14.3573: Response from 10.0.0.40:6116/Interactive / ECHO: BulkString: 16 bytes
20:04:14.3575: Response from 10.0.0.41:6116/Interactive / GET __Booksleeve_TieBreak: Error: MOVED 6202 10.0.0.40:6116
20:04:14.3579: Response from 10.0.0.39:6116/Interactive / GET __Booksleeve_TieBreak: Error: MOVED 6202 10.0.0.40:6116
20:04:14.3582: 10.0.0.40:6116: OnConnectedAsync completed (From command: ECHO)
20:04:14.3584: Response from 10.0.0.41:6116/Interactive / ECHO: BulkString: 16 bytes
20:04:14.3587: Response from 10.0.0.39:6116/Interactive / ECHO: BulkString: 16 bytes
20:04:14.3590: 10.0.0.41:6116: OnConnectedAsync completed (From command: ECHO)
20:04:14.3593: 10.0.0.39:6116: OnConnectedAsync completed (From command: ECHO)
20:04:19.3038: Not all available tasks completed cleanly (from ReconfigureAsync#1489, timeout 5000ms), IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=9,Free=32758,Min=16,Max=32767), POOL: (Threads=23,QueuedItems=0,CompletedItems=443,Timers=11)
20:04:19.3682: 10.0.0.36:6116: OnConnectedAsync completed (Disconnected)
20:04:19.3723: 10.0.0.38:6116: OnConnectedAsync completed (Disconnected)
20:04:19.3727: 10.0.0.37:6116: OnConnectedAsync completed (Disconnected)
20:04:19.3734:   Server[0] (10.0.0.36:6116) Status: WaitingForActivation (inst: 0, qs: 0, in: 0, qu: 0, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: NotStarted. ws: Initializing)
20:04:19.3801:   Server[1] (10.0.0.37:6116) Status: RanToCompletion (inst: 0, qs: 0, in: 0, qu: 0, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: NotStarted. ws: Initializing)
20:04:19.3870:   Server[2] (10.0.0.38:6116) Status: RanToCompletion (inst: 0, qs: 0, in: 0, qu: 0, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: NotStarted. ws: Initializing)
20:04:19.3924:   Server[3] (10.0.0.39:6116) Status: RanToCompletion (inst: 13, qs: 0, in: 0, qu: 13, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4106:   Server[4] (10.0.0.40:6116) Status: RanToCompletion (inst: 13, qs: 0, in: 0, qu: 13, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4208:   Server[5] (10.0.0.41:6116) Status: RanToCompletion (inst: 13, qs: 0, in: 0, qu: 13, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4278: Endpoint summary:
20:04:19.4289:   10.0.0.36:6116: Endpoint is (Interactive: Connecting, Subscription: Connecting)
20:04:19.4291:   10.0.0.37:6116: Endpoint is (Interactive: Connecting, Subscription: Connecting)
20:04:19.4294:   10.0.0.38:6116: Endpoint is (Interactive: Connecting, Subscription: Connecting)
20:04:19.4310: Connection failed: 10.0.0.37:6116 (Subscription, UnableToConnect): UnableToConnect on 10.0.0.37:6116/Subscription, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4317: Connection failed: 10.0.0.36:6116 (Interactive, UnableToConnect): UnableToConnect on 10.0.0.36:6116/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4329: Connection failed: 10.0.0.37:6116 (Interactive, UnableToConnect): UnableToConnect on 10.0.0.37:6116/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4331: Connection failed: 10.0.0.36:6116 (Subscription, UnableToConnect): UnableToConnect on 10.0.0.36:6116/Subscription, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4332: Connection failed: 10.0.0.38:6116 (Subscription, UnableToConnect): UnableToConnect on 10.0.0.38:6116/Subscription, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4334: Connection failed: 10.0.0.38:6116 (Interactive, UnableToConnect): UnableToConnect on 10.0.0.38:6116/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 5s ago, last-write: 5s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.122.38350
20:04:19.4341:   10.0.0.39:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4343:   10.0.0.40:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4346:   10.0.0.41:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4348: Task summary:
20:04:19.4350:   10.0.0.36:6116: Returned, but incorrectly
20:04:19.4353:   10.0.0.37:6116: Returned, but incorrectly
20:04:19.4355:   10.0.0.38:6116: Returned, but incorrectly
20:04:19.4360:   10.0.0.39:6116: Returned with success as Cluster primary (Source: From command: ECHO)
20:04:19.4640:   10.0.0.40:6116: Returned with success as Cluster primary (Source: From command: ECHO)
20:04:19.4649:   10.0.0.41:6116: Returned with success as Cluster primary (Source: From command: ECHO)
20:04:19.4671: 10.0.0.39:6116: OnConnectedAsync already connected start
20:04:19.4674: 10.0.0.39:6116/Interactive: Writing: PING
20:04:19.4679: 10.0.0.41:6116: OnConnectedAsync already connected start
20:04:19.4682: 10.0.0.41:6116/Interactive: Writing: PING
20:04:19.4689: 10.0.0.40:6116: OnConnectedAsync already connected start
20:04:19.4692: 10.0.0.40:6116/Interactive: Writing: PING
20:04:19.4697: Allowing 3 endpoint(s) -00:00:00.1680000 to respond...
20:04:19.4701: Awaiting 3 available task completion(s) for -168ms, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=1,Free=32766,Min=16,Max=32767), POOL: (Threads=23,QueuedItems=0,CompletedItems=472,Timers=17)
20:04:19.4706: Timeout before awaiting for tasks, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=1,Free=32766,Min=16,Max=32767), POOL: (Threads=23,QueuedItems=0,CompletedItems=472,Timers=17)
20:04:19.4710:   Server[0] (10.0.0.39:6116) Status: WaitingForActivation (inst: 15, qs: 1, in: 0, qu: 15, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4714:   Server[1] (10.0.0.41:6116) Status: WaitingForActivation (inst: 14, qs: 1, in: 0, qu: 14, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4718:   Server[2] (10.0.0.40:6116) Status: WaitingForActivation (inst: 14, qs: 1, in: 0, qu: 14, aw: False, in-pipe: 0, out-pipe: 0, bw: Inactive, rs: ReadAsync. ws: Idle)
20:04:19.4722: Endpoint summary:
20:04:19.4725:   10.0.0.39:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4729:   10.0.0.41:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4733:   10.0.0.40:6116: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)
20:04:19.4737: Task summary:
20:04:19.4741:   10.0.0.39:6116: Did not respond (Task.Status: WaitingForActivation)
20:04:19.4745:   10.0.0.41:6116: Did not respond (Task.Status: WaitingForActivation)
20:04:19.4751:   10.0.0.40:6116: Did not respond (Task.Status: WaitingForActivation)
20:04:19.4756: Cluster: 16384 of 16384 slots covered
20:04:19.4764: Endpoint Summary:
20:04:19.4776:   10.0.0.36:6116: Standalone v3.0.0, primary; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in use: DidNotRespond
20:04:19.4796:   10.0.0.36:6116: int ops=0, qu=0, qs=0, qc=0, wr=0, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0, socks=2
20:04:19.4856:   10.0.0.36:6116: Circular op-count snapshot; int: 0 (0,00 ops/s; spans 10s); sub: 0 (0,00 ops/s; spans 10s)
20:04:19.4858:   10.0.0.37:6116: Standalone v3.0.0, primary; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in use: DidNotRespond
20:04:19.4861:   10.0.0.37:6116: int ops=0, qu=0, qs=0, qc=0, wr=0, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0, socks=2
20:04:19.4864:   10.0.0.37:6116: Circular op-count snapshot; int: 0 (0,00 ops/s; spans 10s); sub: 0 (0,00 ops/s; spans 10s)
20:04:19.4866:   10.0.0.38:6116: Standalone v3.0.0, primary; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in use: DidNotRespond
20:04:19.4869:   10.0.0.38:6116: int ops=0, qu=0, qs=0, qc=0, wr=0, socks=2; sub ops=0, qu=0, qs=0, qc=0, wr=0, socks=2
20:04:19.4871:   10.0.0.38:6116: Circular op-count snapshot; int: 0 (0,00 ops/s; spans 10s); sub: 0 (0,00 ops/s; spans 10s)
20:04:19.4875:   10.0.0.39:6116: Cluster v5.0.8, primary; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active; not in use: DidNotRespond
20:04:19.4876:   10.0.0.39:6116: int ops=15, qu=0, qs=1, qc=0, wr=0, socks=1; sub ops=6, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
20:04:19.4879:   10.0.0.39:6116: Circular op-count snapshot; int: 0+15=15 (1,50 ops/s; spans 10s); sub: 0+6=6 (0,60 ops/s; spans 10s)
20:04:19.4881:   10.0.0.40:6116: Cluster v5.0.8, primary; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active; not in use: DidNotRespond
20:04:19.4883:   10.0.0.40:6116: int ops=14, qu=0, qs=1, qc=0, wr=0, socks=1; sub ops=6, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
20:04:19.4886:   10.0.0.40:6116: Circular op-count snapshot; int: 0+14=14 (1,40 ops/s; spans 10s); sub: 0+6=6 (0,60 ops/s; spans 10s)
20:04:19.4889: Response from 10.0.0.39:6116/Interactive / PING: SimpleString: PONG
20:04:19.4891: Response from 10.0.0.40:6116/Interactive / PING: SimpleString: PONG
20:04:19.4894: Response from 10.0.0.41:6116/Interactive / PING: SimpleString: PONG
20:04:19.4896:   10.0.0.41:6116: Cluster v5.0.8, primary; 16 databases; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active; not in use: DidNotRespond
20:04:19.4899: 10.0.0.39:6116: OnConnectedAsync already connected end
20:04:19.4902: 10.0.0.40:6116: OnConnectedAsync already connected end
20:04:19.4904: 10.0.0.41:6116: OnConnectedAsync already connected end
20:04:19.4906:   10.0.0.41:6116: int ops=14, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=6, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
20:04:19.4909:   10.0.0.41:6116: Circular op-count snapshot; int: 0+14=14 (1,40 ops/s; spans 10s); sub: 0+6=6 (0,60 ops/s; spans 10s)
20:04:19.4911: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
20:04:19.4914: Starting heartbeat...
20:04:19.4919: Total connect time: 5 227 ms

The text was updated successfully, but these errors were encountered:

…scovery (StackExchange#2520) Sending tracers is not necessary because we just connected to the nodes a few seconds ago. It causes problems because sent tracers do not have enough time to respond.

Matiszak · 2023-08-17T14:09:58Z

I made a PR #2525 which seem to fix the problem perfectly in my case.

Additionally I found two more scenarios that manifest this issue:

when commands are executed in NoRedirect mode (as I learned it's by default done in Hangfire.Pro.Redis package)
when commands are executed in transaction-/ in MULTI. Then error "MOVED ..." is hidden behind "StackExchange.Redis.RedisServerException: EXECABORT Transaction discarded because of previous errors." so I assume StackExchange.Redis is not aware of it and can't requeue it.

Matiszak changed the title ~~[Improvement] Race condition for being selectable when some nodes in the cluster is down~~ [Improvement] Race condition for being selectable when some nodes in the cluster are down Aug 6, 2023

Matiszak mentioned this issue Aug 16, 2023

Do not send tracers when running second iteration of cluster nodes di… #2525

Merged

NickCraver linked a pull request Sep 10, 2023 that will close this issue

Do not send tracers when running second iteration of cluster nodes di… #2525

Merged

NickCraver closed this as completed in #2525 Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Race condition for being selectable when some nodes in the cluster are down #2520

[Improvement] Race condition for being selectable when some nodes in the cluster are down #2520

Matiszak commented Aug 6, 2023

Matiszak commented Aug 17, 2023

[Improvement] Race condition for being selectable when some nodes in the cluster are down #2520

[Improvement] Race condition for being selectable when some nodes in the cluster are down #2520

Comments

Matiszak commented Aug 6, 2023

Matiszak commented Aug 17, 2023