Detecting cluster failover by application #13359

dev4342345235 · 2025-02-18T09:40:06Z

dev4342345235
Feb 18, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.0.6

Erlang version used

26.2.x

Operating system (distribution) used

Microsoft Windows

How is RabbitMQ deployed?

Windows installer

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

Cluster is configured by DSC scripts

Steps to reproduce the behavior in question

Conceptional question.

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

Hi,

we are using a RabbitMQ Cluster with Quorum queues and connect by C# application to Cluster using a LoadBalancer.

If the majority of the nodes (2 of 3) goes down, the cluster is not usable anymore as far as we understood. Our application needs to detect that condition but unfortunately provided Health Checks in used library seems to only check for network connection, not cluster status. Therefore health is reported as "healthy" although cluster is logically down and does not handle messages anymore.

What would you recommend to detect that kind of failure quickly by application code? Polling constantly one of the following REST endpoints? If yes, which one would you recommend? Or do we need to call multiple enpoints?

/api/aliveness-test/vhost | Declares a test queue on the target node, then publishes and consumes a message. Intended to be used as a very basic health check. Responds a 200 OK if the check succeeded, otherwise responds with a 503 Service Unavailable.
/api/health/checks/alarms | Responds a 200 OK if there are no alarms in effect in the cluster, otherwise responds with a 503 Service Unavailable.
/api/health/checks/local-alarms | Responds a 200 OK if there are no local alarms in effect on the target node, otherwise responds with a 503 Service Unavailable.

As an update: We have simulated a crash of 2/3 RabbitMQ nodes in a cluster. Messages could not be send anymore as expected but suprisingly all the three endpoints mentiondes above return "OK". I would have expected tha at least /alarms returns a warning.

Best
Christoph

Answered by michaelklishin

Feb 18, 2025

@dev4342345235 none of those. GET /api/aliveness-test/{vhost} is a no-op. The other two health checks have nothing to do with whether nodes are up or not, as their description suggests.

When a majority of nodes is down, all quorum queue and stream operations will fail, and so will nearly all client operations in general when Khepri is used. This should be a good enough indication.

It don't subscribe to the opinion that applications should be monitoring cluster state. Monitoring systems should. Having a majority of nodes down and switching between clusters is not something you will do every day, it's not at all comparable to client connection recovery.

That's a job for the dedicated monito…

View full answer

michaelklishin · 2025-02-18T17:20:40Z

michaelklishin
Feb 18, 2025
Maintainer

@dev4342345235 none of those. GET /api/aliveness-test/{vhost} is a no-op. The other two health checks have nothing to do with whether nodes are up or not, as their description suggests.

When a majority of nodes is down, all quorum queue and stream operations will fail, and so will nearly all client operations in general when Khepri is used. This should be a good enough indication.

It don't subscribe to the opinion that applications should be monitoring cluster state. Monitoring systems should. Having a majority of nodes down and switching between clusters is not something you will do every day, it's not at all comparable to client connection recovery.

That's a job for the dedicated monitoring systems and alerts. There are already a Prometheus plugin and Grafana dashboards that display node and inter-node link states.

In the HTTP API specifically, GET /api/nodes is the closest endpoint there is, which the management UI itself uses to mark nodes in red when they cannot be reached.

Many HTTP API clients provide access to that endpoint. In any case, the output looks like this:

[
  {
    "name": "rabbit-1@sunnyside",
    "type": "disc",
    "running": true,
    "being_drained": false,
    "cluster_links": [],
    "metrics_gc_queue_length": {
      "connection_closed": 0,
      "channel_closed": 0,
      "consumer_deleted": 0,
      "exchange_deleted": 0,
      "queue_deleted": 0,
      "vhost_deleted": 0,
      "node_node_deleted": 0,
      "channel_consumer_deleted": 0
    }
  },
  {
    "partitions": [],
    "os_pid": "62206",
    "fd_total": 60000,
    "sockets_total": 0,
    "mem_limit": 41231686041,
    "mem_alarm": false,
    "disk_free_limit": 50000000,
    "disk_free_alarm": false,
    "proc_total": 1048576,
    "rates_mode": "basic",
    "uptime": 1450,
    "run_queue": 1,
    "processors": 10,
    "exchange_types": [
      {
        "name": "direct",
        "description": "AMQP direct exchange, as per the AMQP specification",
        "enabled": true
      },
      {
        "name": "headers",
        "description": "AMQP headers exchange, as per the AMQP specification",
        "enabled": true
      },
      {
        "name": "topic",
        "description": "AMQP topic exchange, as per the AMQP specification",
        "enabled": true
      },
      {
        "name": "fanout",
        "description": "AMQP fanout exchange, as per the AMQP specification",
        "enabled": true
      },
      {
        "name": "x-local-random",
        "description": "Picks one random local binding (queue) to route via (to).",
        "enabled": true
      }
    ],
    "auth_mechanisms": [
      {
        "name": "AMQPLAIN",
        "description": "QPid AMQPLAIN mechanism",
        "enabled": true
      },
      {
        "name": "PLAIN",
        "description": "SASL PLAIN authentication mechanism",
        "enabled": true
      },
      {
        "name": "ANONYMOUS",
        "description": "SASL ANONYMOUS authentication mechanism",
        "enabled": true
      },
      {
        "name": "RABBIT-CR-DEMO",
        "description": "RabbitMQ Demo challenge-response authentication mechanism",
        "enabled": false
      }
    ],
    "applications": [
      {
        "name": "amqp10_common",
        "description": "Modules shared by rabbitmq-amqp1.0 and rabbitmq-amqp1.0-client",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "amqp_client",
        "description": "RabbitMQ AMQP Client",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "asn1",
        "description": "The Erlang ASN1 compiler version 5.3.1",
        "version": "5.3.1"
      },
      {
        "name": "aten",
        "description": "Erlang node failure detector",
        "version": "0.6.0"
      },
      {
        "name": "compiler",
        "description": "ERTS  CXC 138 10",
        "version": "8.5.5"
      },
      {
        "name": "cowboy",
        "description": "Small, fast, modern HTTP server.",
        "version": "2.13.0"
      },
      {
        "name": "cowlib",
        "description": "Support library for manipulating Web protocols.",
        "version": "2.14.0"
      },
      {
        "name": "credentials_obfuscation",
        "description": "Helper library that obfuscates sensitive values in process state",
        "version": "3.4.0"
      },
      {
        "name": "crypto",
        "description": "CRYPTO",
        "version": "5.5.2"
      },
      {
        "name": "cuttlefish",
        "description": "cuttlefish configuration abstraction",
        "version": "3.4.0"
      },
      {
        "name": "enough",
        "description": "A gen_server implementation with additional, overload-protected call type",
        "version": "0.1.0"
      },
      {
        "name": "erts",
        "description": "ERTS  CXC 138 10",
        "version": "15.2.2"
      },
      {
        "name": "gen_batch_server",
        "description": "Generic batching server",
        "version": "0.8.8"
      },
      {
        "name": "horus",
        "description": "Creates standalone modules from anonymous functions",
        "version": "0.3.1"
      },
      {
        "name": "inets",
        "description": "INETS  CXC 138 49",
        "version": "9.3.1"
      },
      {
        "name": "jose",
        "description": "JSON Object Signing and Encryption (JOSE) for Erlang and Elixir.",
        "version": "1.11.10"
      },
      {
        "name": "kernel",
        "description": "ERTS  CXC 138 10",
        "version": "10.2.2"
      },
      {
        "name": "khepri",
        "description": "Tree-like replicated on-disk database library",
        "version": "0.16.0"
      },
      {
        "name": "khepri_mnesia_migration",
        "description": "Tools to migrate between Mnesia and Khepri",
        "version": "0.7.1"
      },
      {
        "name": "mnesia",
        "description": "MNESIA  CXC 138 12",
        "version": "4.23.3"
      },
      {
        "name": "oauth2_client",
        "description": "OAuth2 client from the RabbitMQ Project",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "observer_cli",
        "description": "Visualize Erlang Nodes On The Command Line",
        "version": "1.8.2"
      },
      {
        "name": "os_mon",
        "description": "CPO  CXC 138 46",
        "version": "2.10.1"
      },
      {
        "name": "osiris",
        "description": "Foundation of the log-based streaming subsystem for RabbitMQ",
        "version": "1.8.5"
      },
      {
        "name": "public_key",
        "description": "Public key infrastructure",
        "version": "1.17.1"
      },
      {
        "name": "ra",
        "description": "Raft library",
        "version": "2.16.2"
      },
      {
        "name": "rabbit",
        "description": "RabbitMQ",
        "version": "4.1.0+beta.4.53.g819b80b"
      },
      {
        "name": "rabbit_common",
        "description": "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "rabbitmq_prelaunch",
        "description": "RabbitMQ prelaunch setup",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "rabbitmq_web_dispatch",
        "description": "RabbitMQ Web Dispatcher",
        "version": "4.1.0+beta.4.51.g99a09df"
      },
      {
        "name": "ranch",
        "description": "Socket acceptor pool for TCP protocols.",
        "version": "2.2.0"
      },
      {
        "name": "recon",
        "description": "Diagnostic tools for production use",
        "version": "2.5.6"
      },
      {
        "name": "redbug",
        "description": "Erlang Tracing Debugger",
        "version": "2.0.7"
      },
      {
        "name": "runtime_tools",
        "description": "RUNTIME_TOOLS",
        "version": "2.1.1"
      },
      {
        "name": "sasl",
        "description": "SASL  CXC 138 11",
        "version": "4.2.2"
      },
      {
        "name": "seshat",
        "description": "Counters registry",
        "version": "0.6.1"
      },
      {
        "name": "ssl",
        "description": "Erlang/OTP SSL application",
        "version": "11.2.7"
      },
      {
        "name": "stdlib",
        "description": "ERTS  CXC 138 10",
        "version": "6.2"
      },
      {
        "name": "stdout_formatter",
        "description": "Tools to format paragraphs, lists and tables as plain text",
        "version": "0.2.4"
      },
      {
        "name": "syntax_tools",
        "description": "Syntax tools",
        "version": "3.2.1"
      },
      {
        "name": "sysmon_handler",
        "description": "Rate-limiting system_monitor event handler",
        "version": "1.3.0"
      },
      {
        "name": "systemd",
        "description": "systemd integration for Erlang applications",
        "version": "0.6.1"
      },
      {
        "name": "thoas",
        "description": "A blazing fast JSON parser and generator in pure Erlang.",
        "version": "1.2.1"
      },
      {
        "name": "tools",
        "description": "DEVTOOLS  CXC 138 16",
        "version": "4.1.1"
      },
      {
        "name": "xmerl",
        "description": "XML parser",
        "version": "2.1"
      }
    ],
    "contexts": [],
    "log_files": [
      "/var/folders/36/3knwd0k150z7nwzwx1c_lft80000gn/T/rabbitmq-test-instances/rabbit-2@sunnyside/log/rabbit-2@sunnyside.log",
      "<stdout>"
    ],
    "db_dir": "/var/folders/36/3knwd0k150z7nwzwx1c_lft80000gn/T/rabbitmq-test-instances/rabbit-2@sunnyside/mnesia/rabbit-2@sunnyside",
    "config_files": [],
    "net_ticktime": 60,
    "enabled_plugins": [
      "amqp_client",
      "cowboy",
      "jose",
      "oauth2_client",
      "rabbitmq_management",
      "rabbitmq_management_agent",
      "rabbitmq_web_dispatch"
    ],
    "mem_calculation_strategy": "rss",
    "ra_open_file_metrics": {
      "ra_log_wal": 0,
      "ra_log_segment_writer": 0
    },
    "name": "rabbit-2@sunnyside",
    "type": "disc",
    "running": true,
    "being_drained": false,
    "mem_used": 107839488,
    "mem_used_details": {
      "rate": 0.0
    },
    "fd_used": 65,
    "fd_used_details": {
      "rate": 0.0
    },
    "sockets_used": 0,
    "sockets_used_details": {
      "rate": 0.0
    },
    "proc_used": 322,
    "proc_used_details": {
      "rate": 0.0
    },
    "disk_free": 50267312128,
    "disk_free_details": {
      "rate": 0.0
    },
    "gc_num": 4354,
    "gc_num_details": {
      "rate": 0.0
    },
    "gc_bytes_reclaimed": 138072768,
    "gc_bytes_reclaimed_details": {
      "rate": 0.0
    },
    "context_switches": 56512,
    "context_switches_details": {
      "rate": 0.0
    },
    "io_read_count": 0,
    "io_read_count_details": {
      "rate": 0.0
    },
    "io_read_bytes": 0,
    "io_read_bytes_details": {
      "rate": 0.0
    },
    "io_read_avg_time": 0.0,
    "io_read_avg_time_details": {
      "rate": 0.0
    },
    "io_write_count": 0,
    "io_write_count_details": {
      "rate": 0.0
    },
    "io_write_bytes": 0,
    "io_write_bytes_details": {
      "rate": 0.0
    },
    "io_write_avg_time": 0.0,
    "io_write_avg_time_details": {
      "rate": 0.0
    },
    "io_sync_count": 0,
    "io_sync_count_details": {
      "rate": 0.0
    },
    "io_sync_avg_time": 0.0,
    "io_sync_avg_time_details": {
      "rate": 0.0
    },
    "io_seek_count": 0,
    "io_seek_count_details": {
      "rate": 0.0
    },
    "io_seek_avg_time": 0.0,
    "io_seek_avg_time_details": {
      "rate": 0.0
    },
    "io_reopen_count": 0,
    "io_reopen_count_details": {
      "rate": 0.0
    },
    "mnesia_ram_tx_count": 0,
    "mnesia_ram_tx_count_details": {
      "rate": 0.0
    },
    "mnesia_disk_tx_count": 0,
    "mnesia_disk_tx_count_details": {
      "rate": 0.0
    },
    "msg_store_read_count": 0,
    "msg_store_read_count_details": {
      "rate": 0.0
    },
    "msg_store_write_count": 0,
    "msg_store_write_count_details": {
      "rate": 0.0
    },
    "queue_index_write_count": 0,
    "queue_index_write_count_details": {
      "rate": 0.0
    },
    "queue_index_read_count": 0,
    "queue_index_read_count_details": {
      "rate": 0.0
    },
    "connection_created": 0,
    "connection_created_details": {
      "rate": 0.0
    },
    "connection_closed": 0,
    "connection_closed_details": {
      "rate": 0.0
    },
    "channel_created": 0,
    "channel_created_details": {
      "rate": 0.0
    },
    "channel_closed": 0,
    "channel_closed_details": {
      "rate": 0.0
    },
    "queue_declared": 0,
    "queue_declared_details": {
      "rate": 0.0
    },
    "queue_created": 0,
    "queue_created_details": {
      "rate": 0.0
    },
    "queue_deleted": 0,
    "queue_deleted_details": {
      "rate": 0.0
    },
    "cluster_links": [
      {
        "stats": {
          "send_bytes": 1654,
          "send_bytes_details": {
            "rate": 0.0
          },
          "recv_bytes": 1822,
          "recv_bytes_details": {
            "rate": 0.0
          }
        },
        "name": "rabbitmqcli-75-rabbit-2@sunnyside",
        "peer_addr": "127.0.0.1",
        "peer_port": 63426,
        "sock_addr": "127.0.0.1",
        "sock_port": 25673,
        "recv_bytes": 1822,
        "send_bytes": 1654
      }
    ],
    "metrics_gc_queue_length": {
      "connection_closed": 0,
      "channel_closed": 0,
      "consumer_deleted": 0,
      "exchange_deleted": 0,
      "queue_deleted": 0,
      "vhost_deleted": 0,
      "node_node_deleted": 0,
      "channel_consumer_deleted": 0
    }
  },
  {
    "name": "rabbit-3@sunnyside",
    "type": "disc",
    "running": true,
    "being_drained": false,
    "cluster_links": [],
    "metrics_gc_queue_length": {
      "connection_closed": 0,
      "channel_closed": 0,
      "consumer_deleted": 0,
      "exchange_deleted": 0,
      "queue_deleted": 0,
      "vhost_deleted": 0,
      "node_node_deleted": 0,
      "channel_consumer_deleted": 0
    }
  }
]

4 replies

dev4342345235 Feb 18, 2025
Author

Thanks for your response. For some reasons we need to monitor the cluster health to switch over to a failover cluster, therefore we are looking for reliable ways, to detect a cluster fail.

I will take a look to your suggested GET /api/nodes and maybe use that for an internal decision logic, if majority is down.

Another at first promising REST endpoint seemed to be: api/health/checks/node-is-quorum-critical

but unfortunately that endpoint only returns a warning if 1 of 3 node is down and no different message if 2 of 3 nodes are down.

michaelklishin Feb 18, 2025
Maintainer

@dev4342345235 that endpoint and its respective CLI command exist for the needs of rolling restart automationg.

dev4342345235 Feb 19, 2025
Author

We wrote a test client which polls the api/nodes and it seems to reliable detect failed nodes. Please could you confirm that this code would be usable for a production-grade check of a cluster health? (Of course we would add some retry logic and error handling).

I am just want to be sure, that we understood your hint with the api/nodes endpoint correctly.

string host = "localhost";
int managementPort = 15672;

var credentials = new NetworkCredential() { UserName = "x", Password = "x" };

using (var handler = new HttpClientHandler { Credentials = credentials })
using (var client = new HttpClient(handler))
{
    var url = $"https://{host}:{managementPort}/api/nodes";

    var response = await client.GetAsync(url).ConfigureAwait(false);

    var jsonResponse = await response.Content.ReadAsStringAsync();

    var nodeInfos = JsonSerializer.Deserialize<List<NodeInfo>>(jsonResponse);

    if (nodeInfos is not null)
    {
        foreach (var nodeInfo in nodeInfos)
        {
            Console.WriteLine($"Node {nodeInfo.Name}: {nodeInfo.Running}");
        }

        int total = nodeInfos.Count;
        int healthyNodes = nodeInfos.Count(n => n.Running);

        if (healthyNodes > (total / 2.0))
        {
            Console.WriteLine("Cluster is healthy");
        }
        else
        {
            Console.WriteLine("Cluster is not healthy");
        }

    }
}

michaelklishin Feb 19, 2025
Maintainer

@dev4342345235 this is outright insulting. It's your job to decide how to monitor your cluster. We are not your "free devops on the Internet".

I have explain what monitoring options are available and what is the right thing to do in my opinion, having seen thousands of RabbitMQ installations in my 15 years as a contributor.

Now do the job you are likely paid to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting cluster failover by application #13359

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Detecting cluster failover by application #13359

dev4342345235 Feb 18, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 1 comment · 4 replies

michaelklishin Feb 18, 2025 Maintainer

dev4342345235 Feb 18, 2025 Author

michaelklishin Feb 18, 2025 Maintainer

dev4342345235 Feb 19, 2025 Author

michaelklishin Feb 19, 2025 Maintainer

dev4342345235
Feb 18, 2025

Replies: 1 comment 4 replies

michaelklishin
Feb 18, 2025
Maintainer

dev4342345235 Feb 18, 2025
Author

michaelklishin Feb 18, 2025
Maintainer

dev4342345235 Feb 19, 2025
Author

michaelklishin Feb 19, 2025
Maintainer