RabbitMQ

Running RabbitMQ on Kubernetes & OpenShift: A Deployment Playbook

A

AceMQ Engineering Team

RabbitMQ Consulting & Support

K8sStatefulSet · 3 ReplicasKubernetes & OpenShift
Running RabbitMQ on Kubernetes or OpenShift is increasingly common, and increasingly the source of operational surprises for teams who assume Kubernetes will just handle the complexity. It doesn't — at least not without understanding how RabbitMQ's quorum mechanisms interact with Kubernetes pod lifecycle management.
This guide covers the key deployment patterns, the specific gotchas that cause production incidents, and how to structure your Kubernetes RabbitMQ deployment for reliable operation.

How do you deploy RabbitMQ on Kubernetes?

The standard approach is the RabbitMQ Cluster Operator, the official Kubernetes operator maintained by Broadcom. The operator handles:
  • Deploying RabbitMQ as a StatefulSet with proper pod naming and stable network identities
  • Managing PersistentVolumeClaims (PVCs) for each cluster node's data storage
  • Handling cluster membership, node discovery, and the Erlang cookie
  • Providing a RabbitmqCluster custom resource for declarative cluster configuration
For OpenShift, the same operator works with some additional SCC (Security Context Constraint) configuration. AceMQ also supports deployments on RKE2, AKS, ACK, and other managed Kubernetes variants.
Minimum cluster configuration: Three nodes, each on a separate Kubernetes worker node (enforced through pod anti-affinity rules). Running two RabbitMQ pods on the same worker node defeats high availability.

What is the termination grace period and why does it matter?

The RabbitMQ Cluster Operator includes a pre-stop hook with a default termination grace period of 604,800 seconds (one week). This is not a mistake — it's a safety mechanism.
Before a RabbitMQ pod shuts down, the pre-stop hook checks whether any queues currently mastered on that pod are in a quorum-critical state — meaning the pod's shutdown would bring one or more queues below quorum. If that condition is true, the pod refuses to terminate until quorum is restored.

"What the pre-stop hook does is it checks to make sure that none of the queues are in a quorum-critical status before it allows the pod to exit. It's protecting quorum. But when you go from three to zero replicas, it creates a deadlock — because taking one pod down makes quorum critical, which prevents that pod from terminating, which prevents the cluster from coming down."

Scott Sternloff, AceMQ Principal Architect, Adeptia engagement session, April 2026

How do you safely scale a RabbitMQ cluster to zero on Kubernetes?

The safe procedure for scaling down to zero requires two steps executed in sequence, not simultaneously:
Step 1: Reduce the termination grace period to 30 seconds:
kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"terminationGracePeriodSeconds": 30}}'
Step 2: After that change takes effect, scale replicas to zero:
kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"replicas": 0}}'
When you bring the cluster back up, restore the grace period to its default:
kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"replicas": 3, "terminationGracePeriodSeconds": 604800}}'

What happens when a Kubernetes node gets drained during a cluster upgrade?

When a Kubernetes cluster upgrade drains a node, pods on that node are terminated. If a RabbitMQ pod is running on the drained node and its queues are quorum-critical, the pod will get stuck in terminating state — blocking the node drain, blocking the upgrade, and requiring manual intervention.

"Your clients have an automated process to upgrade their cluster. At a time, one node can go down. And because the RabbitMQ pod running on that node refuses to terminate due to quorum protection, it won't allow the node to be drained. Their upgrade process gets stuck."

Scott Sternloff, AceMQ Principal Architect, Adeptia session, April 2026

Solutions:
  1. Reduce the default termination grace period before automated upgrade windows, and restore it afterward
  2. Set a watchdog timer: if a pod hasn't terminated within N seconds of receiving a SIGTERM, force-delete it (use cautiously)
  3. Use pod disruption budgets (PDBs): a PDB set to allow at most one unavailable pod at a time ensures Kubernetes respects quorum constraints during node drains

What should you never do with RabbitMQ pods on Kubernetes?

Never force-delete a RabbitMQ pod without understanding the quorum impact. Force-deleting (kubectl delete pod --force --grace-period=0) bypasses the pre-stop hook entirely. If the pod being force-deleted is the quorum leader for any queue, those queues lose quorum immediately.
Never run two RabbitMQ pods on the same Kubernetes node. Anti-affinity rules should enforce this automatically, but verify your anti-affinity configuration is correctly set.
Never scale from 3 replicas to a lower number without first verifying quorum health.
kubectl exec -n <namespace> <rabbitmq-pod> -- rabbitmq-diagnostics check_running
kubectl exec -n <namespace> <rabbitmq-pod> -- rabbitmq-queues check_if_node_is_quorum_critical
If any queues report as quorum-critical, wait for them to recover before proceeding with the scale-down.

Storage and monitoring considerations

Each RabbitMQ pod requires a PersistentVolumeClaim. For production:
  • Use a StorageClass with volumeBindingMode: WaitForFirstConsumer to ensure pods and their volumes land on the same availability zone
  • SSD or NVMe storage is strongly recommended — quorum queue WAL writes are latency-sensitive
  • Set retention policy to Retain on PVCs so that data survives pod deletion
  • Do not use shared storage (NFS, shared file systems) for RabbitMQ data volumes
Key metrics to track in Kubernetes-specific deployments:
  • rabbitmq_quorum_queue_stat_voters — verify quorum is maintained across nodes
  • Pod restart counts — frequent restarts indicate quorum or resource issues
  • PVC capacity — monitor disk utilization on each pod's PVC
  • rabbitmq_node_disk_free — alert before the disk alarm triggers
If you're planning a new deployment, migrating an existing cluster to Kubernetes, or dealing with upgrade-related issues, contact AceMQ for deployment architecture or hands-on support.

Free Consultation

Get Expert Eyes on Your RabbitMQ Cluster

Whether you're troubleshooting a production incident, planning a migration, or want a second opinion on your architecture — our team is ready. No pitch, just answers.

Email Us