Running RabbitMQ on Kubernetes & OpenShift: A Deployment Playbook

Running RabbitMQ on Kubernetes or OpenShift is increasingly common, and increasingly the source of operational surprises for teams who assume Kubernetes will just handle the complexity. It doesn't — at least not without understanding how RabbitMQ's quorum mechanisms interact with Kubernetes pod lifecycle management.

This guide covers the key deployment patterns, the specific gotchas that cause production incidents, and how to structure your Kubernetes RabbitMQ deployment for reliable operation.

How do you deploy RabbitMQ on Kubernetes?

The standard approach is the RabbitMQ Cluster Operator, the official Kubernetes operator maintained by Broadcom. The operator handles:

Deploying RabbitMQ as a StatefulSet with proper pod naming and stable network identities
Managing PersistentVolumeClaims (PVCs) for each cluster node's data storage
Handling cluster membership, node discovery, and the Erlang cookie
Providing a RabbitmqCluster custom resource for declarative cluster configuration

For OpenShift, the same operator works with some additional SCC (Security Context Constraint) configuration. AceMQ also supports deployments on RKE2, AKS, ACK, and other managed Kubernetes variants.

Minimum cluster configuration: Three nodes, each on a separate Kubernetes worker node (enforced through pod anti-affinity rules). Running two RabbitMQ pods on the same worker node defeats high availability.

What is the termination grace period and why does it matter?

The RabbitMQ Cluster Operator includes a pre-stop hook with a default termination grace period of 604,800 seconds (one week). This is not a mistake — it's a safety mechanism.

Before a RabbitMQ pod shuts down, the pre-stop hook checks whether any queues currently mastered on that pod are in a quorum-critical state — meaning the pod's shutdown would bring one or more queues below quorum. If that condition is true, the pod refuses to terminate until quorum is restored.

"What the pre-stop hook does is it checks to make sure that none of the queues are in a quorum-critical status before it allows the pod to exit. It's protecting quorum. But when you go from three to zero replicas, it creates a deadlock — because taking one pod down makes quorum critical, which prevents that pod from terminating, which prevents the cluster from coming down."
— Scott Sternloff, AceMQ Principal Architect, Adeptia engagement session, April 2026

How do you safely scale a RabbitMQ cluster to zero on Kubernetes?

The safe procedure for scaling down to zero requires two steps executed in sequence, not simultaneously:

Step 1: Reduce the termination grace period to 30 seconds:

kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"terminationGracePeriodSeconds": 30}}'

Step 2: After that change takes effect, scale replicas to zero:

kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"replicas": 0}}'

When you bring the cluster back up, restore the grace period to its default:

kubectl patch rabbitmqcluster <cluster-name> -n <namespace> \
  --type merge \
  -p '{"spec": {"replicas": 3, "terminationGracePeriodSeconds": 604800}}'

What happens when a Kubernetes node gets drained during a cluster upgrade?

When a Kubernetes cluster upgrade drains a node, pods on that node are terminated. If a RabbitMQ pod is running on the drained node and its queues are quorum-critical, the pod will get stuck in terminating state — blocking the node drain, blocking the upgrade, and requiring manual intervention.

"Your clients have an automated process to upgrade their cluster. At a time, one node can go down. And because the RabbitMQ pod running on that node refuses to terminate due to quorum protection, it won't allow the node to be drained. Their upgrade process gets stuck."
— Scott Sternloff, AceMQ Principal Architect, Adeptia session, April 2026

Solutions:

Reduce the default termination grace period before automated upgrade windows, and restore it afterward
Set a watchdog timer: if a pod hasn't terminated within N seconds of receiving a SIGTERM, force-delete it (use cautiously)
Use pod disruption budgets (PDBs): a PDB set to allow at most one unavailable pod at a time ensures Kubernetes respects quorum constraints during node drains

What should you never do with RabbitMQ pods on Kubernetes?

Never force-delete a RabbitMQ pod without understanding the quorum impact. Force-deleting (kubectl delete pod --force --grace-period=0) bypasses the pre-stop hook entirely. If the pod being force-deleted is the quorum leader for any queue, those queues lose quorum immediately.

Never run two RabbitMQ pods on the same Kubernetes node. Anti-affinity rules should enforce this automatically, but verify your anti-affinity configuration is correctly set.

Never scale from 3 replicas to a lower number without first verifying quorum health.

kubectl exec -n <namespace> <rabbitmq-pod> -- rabbitmq-diagnostics check_running
kubectl exec -n <namespace> <rabbitmq-pod> -- rabbitmq-queues check_if_node_is_quorum_critical

If any queues report as quorum-critical, wait for them to recover before proceeding with the scale-down.

Storage and monitoring considerations

Each RabbitMQ pod requires a PersistentVolumeClaim. For production:

Use a StorageClass with volumeBindingMode: WaitForFirstConsumer to ensure pods and their volumes land on the same availability zone
SSD or NVMe storage is strongly recommended — quorum queue WAL writes are latency-sensitive
Set retention policy to Retain on PVCs so that data survives pod deletion
Do not use shared storage (NFS, shared file systems) for RabbitMQ data volumes

Key metrics to track in Kubernetes-specific deployments:

rabbitmq_quorum_queue_stat_voters — verify quorum is maintained across nodes
Pod restart counts — frequent restarts indicate quorum or resource issues
PVC capacity — monitor disk utilization on each pod's PVC
rabbitmq_node_disk_free — alert before the disk alarm triggers

If you're planning a new deployment, migrating an existing cluster to Kubernetes, or dealing with upgrade-related issues, contact AceMQ for deployment architecture or hands-on support.

Running RabbitMQ on Kubernetes & OpenShift: A Deployment Playbook

How do you deploy RabbitMQ on Kubernetes?

What is the termination grace period and why does it matter?

How do you safely scale a RabbitMQ cluster to zero on Kubernetes?

What happens when a Kubernetes node gets drained during a cluster upgrade?

What should you never do with RabbitMQ pods on Kubernetes?

Storage and monitoring considerations

Related Reading from AceMQ

Open-Source vs. Commercial RabbitMQ: The Honest Decision Guide

How RabbitMQ Licensing & Pricing Works: The Per-vCPU-Core Model Explained

Upgrading RabbitMQ 3.x to 4.x Without Downtime: Rolling vs. Blue-Green

Related Services

RabbitMQ Support

RabbitMQ Licensing

RabbitMQ Consulting

Get Expert Eyes on Your RabbitMQ Cluster