Detecting cluster failover by application #13359
-
Community Support Policy
RabbitMQ version used4.0.6 Erlang version used26.2.x Operating system (distribution) usedMicrosoft Windows How is RabbitMQ deployed?Windows installer rabbitmq-diagnostics status outputSee https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Logs from node 1 (with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
rabbitmq.confSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Steps to deploy RabbitMQ clusterCluster is configured by DSC scripts Steps to reproduce the behavior in questionConceptional question. advanced.configSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Application code# PASTE CODE HERE, BETWEEN BACKTICKS Kubernetes deployment file# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS What problem are you trying to solve?Hi, we are using a RabbitMQ Cluster with Quorum queues and connect by C# application to Cluster using a LoadBalancer. If the majority of the nodes (2 of 3) goes down, the cluster is not usable anymore as far as we understood. Our application needs to detect that condition but unfortunately provided Health Checks in used library seems to only check for network connection, not cluster status. Therefore health is reported as "healthy" although cluster is logically down and does not handle messages anymore. What would you recommend to detect that kind of failure quickly by application code? Polling constantly one of the following REST endpoints? If yes, which one would you recommend? Or do we need to call multiple enpoints?
As an update: We have simulated a crash of 2/3 RabbitMQ nodes in a cluster. Messages could not be send anymore as expected but suprisingly all the three endpoints mentiondes above return "OK". I would have expected tha at least /alarms returns a warning. Best |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
@dev4342345235 none of those. When a majority of nodes is down, all quorum queue and stream operations will fail, and so will nearly all client operations in general when Khepri is used. This should be a good enough indication. It don't subscribe to the opinion that applications should be monitoring cluster state. Monitoring systems should. Having a majority of nodes down and switching between clusters is not something you will do every day, it's not at all comparable to client connection recovery. That's a job for the dedicated monitoring systems and alerts. There are already a Prometheus plugin and Grafana dashboards that display node and inter-node link states. In the HTTP API specifically, Many HTTP API clients provide access to that endpoint. In any case, the output looks like this:
|
Beta Was this translation helpful? Give feedback.
@dev4342345235 none of those.
GET /api/aliveness-test/{vhost}
is a no-op. The other two health checks have nothing to do with whether nodes are up or not, as their description suggests.When a majority of nodes is down, all quorum queue and stream operations will fail, and so will nearly all client operations in general when Khepri is used. This should be a good enough indication.
It don't subscribe to the opinion that applications should be monitoring cluster state. Monitoring systems should. Having a majority of nodes down and switching between clusters is not something you will do every day, it's not at all comparable to client connection recovery.
That's a job for the dedicated monito…