TelecommunicationsRemediationCloud / Kubernetes

Scaling RabbitMQ to support hundreds of thousands of connected devices for a global telecom leader

Global Telecom Leader

Overview

The company's IoT platform relies on RabbitMQ to handle messaging for over 300,000 connected devices with plans for significant scaling. Weekly out-of-memory crashes and performance bottlenecks were threatening platform stability and growth plans.

Challenge

RabbitMQ nodes were crashing weekly due to out-of-memory errors caused by improper vertical scaling approach, low prefetch counts (set to 1), and missing TTL configurations on queues. Producer blocking from slow consumers was creating cascading failures. The team was using classic queues without mirroring, leaving them vulnerable to data loss during node failures.

Environment

Kubernetes deployment, RabbitMQ handling 300,000+ device connections, Spring Boot consumers, plans for horizontal scaling with Prometheus/Grafana monitoring.

Approach

AceMQ conducted a multi-day intensive remediation with live environment review, real-time configuration changes, and hands-on performance optimization. The team analyzed metrics, reviewed topology, and provided detailed architectural recommendations for scaling.

Solution

Recommended horizontal scaling strategy (more nodes) instead of vertical scaling
Increased prefetch count from 1 to 20 for dramatically improved consumer throughput
Configured lazy queues to reduce producer blocking by enabling direct-to-disk writes
Implemented per-queue memory limits and TTL policies to prevent OOM crashes
Raised high water mark from 0.7 to 0.9 for better memory utilization
Designed migration path from classic queues to quorum queues for high availability
Provided Excel-based server requirements calculator for capacity planning

Outcome

The client eliminated weekly RabbitMQ crashes, achieved significantly improved consumer throughput, and has a clear scaling roadmap to support growing device counts. The platform now handles 300,000+ devices without stability issues.

Technologies

RabbitMQKubernetesQuorum QueuesPrometheusGrafana

Related Use Cases

RabbitMQ Performance Tuning

Throughput, latency, and resource utilization optimization including queue design, publisher confirms, replication settings, and concurrency tuning.

RabbitMQ Kubernetes Stabilization

Hardening RabbitMQ in Kubernetes environments with StatefulSet tuning, quorum queue optimization, storage isolation, and memory/network configuration.

Ready to Get Started?

Whether you need architecture advisory, 24/7 support, or full managed services, AceMQ has the expertise to help.