Module A-15·18 min read

The decision framework: standalone for development and tolerable restart downtime, Sentinel for automatic failover with a single-node dataset, Cluster for horizontal scaling past single-node RAM. Operational cost of each and when managed Redis wins.

A-15 — Topology Decision Tree: Standalone, Sentinel, or Cluster

Q: Your application requires an in-memory cache to store API responses. The data is entirely ephemeral; if the cache goes down, the application will simply fall back to querying the primary PostgreSQL database until the cache recovers. The total dataset will never exceed 10GB, and the write volume is moderate. Which Redis topology is the most appropriate and cost-effective choice?

A Standalone Redis instance with persistence disabled, manually restarted if it fails. — This scenario describes a pure cache where data loss is completely acceptable. The application is already designed to handle cache misses by falling back to the database. Introducing Sentinel or Cluster adds significant operational complexity (monitoring, configuration, infrastructure costs) to solve a problem (high availability) that the application architecture has already mitigated. A standalone instance is simple, cheap, and perfectly suited for disposable data.

Q: Your engineering team is building a high-traffic e-commerce platform. Redis is used to store user sessions and a distributed job queue. It is critical that Redis is highly available (automatic failover is required). The total dataset is projected to be 40GB. The application frequently uses complex Lua scripts that atomically modify multiple keys across different logical namespaces. Which topology should you choose?

Redis Sentinel with a primary and read replicas. — The key constraints here are: automatic failover is required, the dataset fits comfortably in a single machine's RAM (40GB dataset on an 80GB+ RAM instance is fine), and the application relies heavily on unrestricted multi-key operations (Lua scripts across namespaces). Redis Cluster explicitly restricts multi-key operations to keys residing in the same hash slot. While D is technically possible, rewriting complex Lua scripts to use hash tags introduces significant engineering friction and complexity when the dataset *doesn't actually require* horizontal sharding. Sentinel provides the required high availability while retaining full, unrestricted command support.

Q: What is the primary operational trade-off you must accept when moving from a Redis Sentinel architecture to a Redis Cluster architecture?

You gain horizontal write scaling and infinite dataset size, but you take on significantly higher operational complexity (managing slots, resharding, cross-slot command restrictions, and gossip protocol monitoring). — Redis Cluster solves two specific problems: datasets larger than a single machine's RAM, and write workloads that exceed a single node's CPU capacity. However, the cost of solving these problems is steep. You introduce the complexity of slot management, the engineering constraint of hash tags for multi-key operations, and the operational burden of monitoring a complex distributed system (gossip convergence, partial failures). It is a necessary trade-off for massive scale, but an unnecessary burden if Sentinel meets your needs.

This is the final module of the Architect tier. Every module in this course has built toward this decision. You now have the vocabulary, the mental models, and the production experience to answer the question every Redis architect eventually faces: what topology should I deploy? This module synthesises the course into a practical decision framework — not a flowchart, but a reasoned guide based on your actual requirements.

The Three Topologies

Standalone Redis

A single Redis primary with no high-availability configuration. The simplest deployment.

App → Redis Primary

What you get:

Zero operational complexity
Full command support (no Cluster restrictions)
Up to ~1M ops/sec on modern hardware
Dataset limited to one machine's RAM

What you don't get:

Automatic failover (Redis down = application down until manual recovery)
Data distribution beyond one node

Redis Sentinel

A primary with one or more replicas, monitored by 3+ Sentinel processes. Automatic failover on primary failure.

text

App → (Sentinel-discovered) Redis Primary ← Replicas
         ↑
    Sentinel × 3

What you get:

Automatic failover (10–30 seconds of write downtime on primary failure)
Read scaling via replicas
Full command support (no Cluster restrictions on multi-key ops)
Dataset limited to one primary's RAM

What you don't get:

Horizontal write scaling
Dataset scaling beyond one machine's RAM

Redis Cluster

Data sharded across multiple primary nodes, each with replicas. Built-in failover.

text

App → Cluster-aware client
         ↓
  ┌─────────────────────┐
  │ Node 1  │ Node 2  │ Node 3 │  (primaries)
  │ slots   │ slots   │ slots  │
  │ 0-5460  │5461-    │10923-  │
  │         │10922    │16383   │
  └─────────────────────┘
  Replicas for each primary

What you get:

Horizontal write scaling (add nodes to add capacity)
Dataset scaling beyond single-node RAM
Built-in automatic failover per shard

What you don't get:

Unrestricted multi-key operations (keys on different slots require hash tags)
SELECT for multiple databases (database 0 only)
Simplicity — significantly more operational complexity

The Decision Framework

Work through these questions in order. Stop when you find your answer.

Question 1: Is Redis purely a cache where data loss is acceptable?

Yes → Use Standalone with no persistence.

The simplest deployment. If Redis goes down, your application falls back to the database, the cache re-populates on the next request, and you move on. The cost of Redis downtime is slower application response, not data loss.

This covers the majority of Redis use cases. Most Redis deployments are caches.

Question 2: Does your dataset fit in one machine's RAM?

64GB of application data → 64GB primary + 64GB replica → needs a machine with ≥ 128GB RAM for safe BGSAVE operation

A rough formula: machine RAM ≥ 1.5× dataset size (to handle BGSAVE CoW overhead).

Yes, it fits → continue to Question 3.
No, it does not fit → strong signal for Cluster (skip to Question 5).

Question 3: Can you tolerate manual failover?

Manual failover means: primary crashes → you get paged → you promote a replica with REPLICAOF NO ONE → you update application config → application reconnects. Typical duration: 5–30 minutes depending on on-call response time.

Yes, manual failover is acceptable (development, staging, non-critical production) → Standalone with a manually-managed replica for backups.

No, you need automatic failover → continue to Question 4.

Question 4: Do you need unrestricted multi-key operations?

If your application relies on MULTI/EXEC across arbitrary keys, Lua scripts accessing keys from different hash slots, or complex MGET/MSET patterns — Cluster's hash-slot restriction is a significant constraint.

Yes, you need unrestricted multi-key ops → Sentinel.

Sentinel gives you automatic failover with full command support and no Cluster restrictions.

No, you can work with hash tags and single-slot constraints → Sentinel is still simpler unless you also need horizontal scaling.

Question 5: Do you need horizontal write scaling?

Can your write workload be handled by a single Redis node (up to ~500K writes/second)?

Yes, single node is sufficient → Sentinel.

Sentinel handles automatic failover, read scaling via replicas, and full command support. It is simpler to operate than Cluster and appropriate for the vast majority of production Redis deployments.

No, you need to distribute writes across multiple nodes → Cluster.

Question 6 (Cluster path): Can your application keys be co-located with hash tags?

If you answered "yes" to horizontal scaling need and are considering Cluster:

Can all related keys be accessed in the same hash slot via {tag} naming?

If your application uses Lua scripts or transactions that span multiple unrelated key namespaces, Cluster's cross-slot restriction may require significant refactoring.

Yes, you can adopt hash tags → Cluster — proceed with hash tag design before implementing.

No, cross-slot operations are fundamental → Consider:

Refactoring to use Cluster-compatible patterns (worth it if you genuinely need horizontal scale)
Managed Redis with larger node sizes (delay Cluster adoption)
A different distributed cache architecture

Decision Summary Table

Requirement	Topology	Notes
Pure cache, data loss OK	Standalone	No persistence needed
Auth data, tolerate manual failover	Standalone + replica	Manual promotion procedure
Auto-failover, fits in single node RAM	Sentinel	Most production Redis deployments
Auto-failover, read scaling	Sentinel + replicas	Standard production setup
Dataset > single node RAM	Cluster	Hash tag design required
Write throughput > 500K/sec	Cluster	Rare — most need Sentinel
Multi-region, low write latency	Redis Enterprise Active-Active	Not OSS Redis

The Case for Managed Redis

Before deciding between Sentinel and Cluster, evaluate whether a managed Redis service simplifies your decision:

AWS ElastiCache:

Cluster mode disabled: Sentinel-equivalent — multi-AZ failover, read replicas, automated failover in < 60 seconds
Cluster mode enabled: Redis Cluster — add shards for horizontal scaling
Handles all infrastructure, patching, and failover orchestration

Google Cloud Memorystore:

Standard tier: primary + replica, automated failover
Cluster: full Redis Cluster support (Preview as of 2024)

Redis Cloud (Redis Ltd.):

Full Redis Enterprise capabilities — active-active geo-distribution, modules, higher availability SLAs
Premium pricing but managed operations

Recommendation: For most production deployments, use a managed service. The operational cost of running Sentinel or Cluster yourself (monitoring, patching, failover testing, backup management) is significant. Managed services commoditise this work.

The Operational Cost of Cluster

Before choosing Cluster, be honest about the operational overhead:

Monitoring: 3+ primaries + 3+ replicas = 6+ nodes to monitor, alert on, and maintain
Resharding: adding nodes requires careful slot migration planning
Upgrades: rolling upgrades across a cluster require a defined procedure
Debugging: cross-slot errors, gossip convergence delays, partial cluster failures
Development parity: developers need a local Cluster setup or a proxy that simulates Cluster behaviour

For teams without dedicated Redis operational expertise, the jump from Sentinel to Cluster often results in prolonged incidents that a managed service would have handled automatically.

The question is not "can we run Cluster?" but "should we run Cluster, or should we use a managed service that runs it for us?"

Putting It All Together

Here is the honest synthesis of this course:

Redis is an extraordinarily capable tool. The primitives — atomic counters, pub/sub, streams, Lua scripts, sorted sets — are elegant solutions to problems that would require significant infrastructure elsewhere.

Redis is also easy to misuse. The most common mistakes are not in the code — they are architectural: using it as a primary database without understanding durability trade-offs, ignoring eviction under memory pressure, not planning for the coordination state it holds, and treating "it's on Redis" as synonymous with "it's reliable."

The engineers who use Redis most effectively understand two things:

Redis's mental model — that every operation is a data structure manipulation, that memory is the primary resource, that the single-threaded event loop is both its strength and its constraint.
The distributed systems context — that Redis is one component in a larger system, that its persistence guarantees are weaker than a database, that replication lag is real, and that locks are best-effort coordination, not guarantees.

You now have both. Every module in this course has added a layer to this understanding. Use it well.

Course Complete

Foundation tier (F-1 through F-11): Data structures, memory model, expiry, eviction, HyperLogLog, Bitmaps, Geo, pipelining, Pub/Sub, Streams, memory internals, transactions, caching patterns.

Practitioner tier (P-1 through P-13): RDB persistence, AOF persistence, persistence decision framework, memory profiling, atomic operations, BullMQ internals, cache failure modes, keyspace notifications, session management, connection pooling, monitoring, security, the event loop.

Architect tier (A-1 through A-15): Distributed locking, Lua scripting, Redis Functions, Redlock, advanced lock patterns, the SupraScan production case study, replication, Sentinel, Cluster, resharding, gossip, multi-region, disaster recovery, performance tuning, topology decisions.

What comes next for you:

Apply these concepts to a real system under load — theory becomes intuition through production experience
Contribute to open-source Redis tooling (the ecosystem of client libraries, monitoring exporters, and operational tools)
Read the Redis source code — it is well-written C and the best documentation of the internals
Follow antirez (Redis creator) and the Redis maintainers' writeups — the design decisions behind each feature are illuminating

Redis rewards deep understanding. You have the foundation. Now build something with it.

Knowledge Check

Your application requires an in-memory cache to store API responses. The data is entirely ephemeral; if the cache goes down, the application will simply fall back to querying the primary PostgreSQL database until the cache recovers. The total dataset will never exceed 10GB, and the write volume is moderate. Which Redis topology is the most appropriate and cost-effective choice?

Your engineering team is building a high-traffic e-commerce platform. Redis is used to store user sessions and a distributed job queue. It is critical that Redis is highly available (automatic failover is required). The total dataset is projected to be 40GB. The application frequently uses complex Lua scripts that atomically modify multiple keys across different logical namespaces. Which topology should you choose?

What is the primary operational trade-off you must accept when moving from a Redis Sentinel architecture to a Redis Cluster architecture?

Test your knowledge with more question sets

PreviousModule A-14: Performance Benchmarking and Production Tuning

Discussion

Join the discussion

Loading comments...