Module A-8·22 min read

Sentinel as an independent high-availability process, subjective down vs objective down, the failover election sequence, min-replicas-to-write for split-brain prevention, and what Sentinel cannot protect against.

A-8 — Redis Sentinel: Quorum, Failover, and Split-Brain Prevention

Q: Why is it recommended to deploy an odd number of Sentinel instances (e.g., 3 instead of 2) in a Redis high-availability architecture?

To ensure that a strict majority (quorum) can be achieved during a network partition, preventing two isolated Sentinels from simultaneously electing different failover leaders. — Sentinel uses a quorum-based agreement system (similar to Raft) to declare an Objective Down (ODOWN) state and to elect a failover leader. If you have 2 Sentinels and they lose network connection to each other, neither can form a majority (2/2 = 100%, 1/2 = 50% which is not a majority). With 3 Sentinels, if one network partition isolates one Sentinel, the other two can still communicate, form a majority (2/3), agree the primary is down, and safely execute the failover.

Q: In a typical Node.js application using `ioredis` configured with Sentinel, how does the application handle a primary node failure?

The Sentinel-aware client library subscribes to the `__sentinel__:hello` pub/sub channel; upon receiving a `+switch-master` event, it automatically updates its internal connection pool and reconnects to the newly promoted primary. — Sentinel is not a proxy. Clients connect directly to the Redis nodes. However, a Sentinel-aware client (like `ioredis` when initialized with the `sentinels: []` array) handles the failover orchestration behind the scenes. It discovers the current primary from the Sentinels on startup, listens for pub/sub notifications about failover events, and automatically repoints its connection to the new primary when a promotion occurs, making the failover almost transparent to the application code.

Q: A network partition isolates the Redis primary and the application servers from the Sentinels and replicas. The application continues writing to the isolated primary for 5 minutes. The Sentinels, unable to reach the primary, promote a replica. When the network partition heals, what happens to the 5 minutes of data written to the old primary?

The old primary is reconfigured as a replica of the new primary, performs a full resync from the new primary, and permanently loses all the writes it accepted during the partition. — This is the classic "split-brain" data loss scenario. Because Redis prioritizes availability over strict consistency, the old primary cheerfully accepted writes while isolated. When the partition heals, Sentinel demotes the old primary and makes it a replica of the newly elected leader. A new replica must synchronize with its primary, which involves wiping its own dataset and loading the primary's RDB snapshot. All writes made to the isolated primary are permanently destroyed. This is why configuring `min-replicas-to-write` is critical to prevent the old primary from accepting writes during isolation.

Who this module is for: You want Redis to automatically recover from a primary failure — promoting a replica and reconfiguring clients — without manual intervention. Redis Sentinel is the high-availability solution for single-node (non-Cluster) Redis deployments. This module covers the Sentinel model, the failover sequence, split-brain prevention, and Sentinel's limitations.

What Sentinel Is

Redis Sentinel is a separate process (not part of Redis itself) that monitors a Redis primary and its replicas. When the primary fails, Sentinel:

Detects the failure (agrees with other Sentinels that the primary is down)
Elects a leader Sentinel to run the failover
Selects the best replica to promote
Promotes the chosen replica to primary
Configures other replicas to replicate from the new primary
Notifies clients of the new primary address

Sentinel is not a proxy — it does not route traffic. It is a monitoring and orchestration layer that clients query to discover the current primary address.

Deployment Topology

A minimum Sentinel setup requires 3 Sentinel processes on separate machines. An odd number is required for quorum.

text

Machine 1: Redis Primary + Sentinel 1
Machine 2: Redis Replica + Sentinel 2
Machine 3: Redis Replica + Sentinel 3

Clients connect to any Sentinel to discover the current primary's IP and port. They then connect directly to the primary for all operations.

Sentinel Configuration

text

# sentinel.conf (same format for all Sentinel instances)
sentinel monitor mymaster 10.0.1.50 6379 2
# Name:     mymaster
# Primary:  10.0.1.50:6379
# Quorum:   2 (how many Sentinels must agree the primary is down before failover)

sentinel auth-pass mymaster your-primary-password
sentinel down-after-milliseconds mymaster 5000
# Mark primary as "subjectively down" if unreachable for 5 seconds

sentinel failover-timeout mymaster 60000
# Maximum time to complete a failover (60 seconds)

sentinel parallel-syncs mymaster 1
# How many replicas can sync from new primary simultaneously during failover
# 1 = replicas sync one at a time (slower failover but less load spike)

The Failure Detection Sequence

Step 1: Subjective Down (SDOWN)

A single Sentinel marks the primary as "subjectively down" if it cannot reach the primary within down-after-milliseconds. This is one Sentinel's opinion — a network blip between just that Sentinel and the primary would cause an SDOWN that does not represent a real failure.

text

Sentinel 1: PING to primary... timeout (5 seconds)
Sentinel 1: Primary is SDOWN (subjectively down — my opinion only)

Step 2: Objective Down (ODOWN)

A Sentinel queries other Sentinels: "Do you also think the primary is down?" If at least quorum Sentinels agree, the primary is declared "objectively down" (ODOWN) — a real failure.

text

Sentinel 1 → Sentinel 2: "Is mymaster down?" → Yes (SDOWN)
Sentinel 1 → Sentinel 3: "Is mymaster down?" → Yes (SDOWN)
Sentinel 1: Quorum reached (2/3) → primary is ODOWN

With quorum 2: at least 2 of 3 Sentinels must agree. This prevents a single Sentinel's network issue from triggering an unnecessary failover.

Step 3: Leader Election

One Sentinel must be elected to lead the failover. Sentinel uses a Raft-like election: each Sentinel requests votes from others. The first to receive a majority becomes the failover leader.

Step 4: Replica Selection

The leader Sentinel chooses which replica to promote. Selection criteria (in order of preference):

Replica with the lowest slave-priority (configured as replica-priority in replica's redis.conf)
Replica with the smallest replication lag (most up-to-date data)
Replica with the smallest Run ID (lexicographically) as tiebreaker

text

sentinel slave-priority: lower is preferred for promotion
# replica-priority 100 (default)
# replica-priority 0 means "never promote this replica" (e.g., replica used for backups)

Step 5: Failover Execution

text

1. Leader Sentinel sends REPLICAOF NO ONE to the chosen replica → it becomes primary
2. Leader Sentinel configures remaining replicas: REPLICAOF {new-primary-ip} {port}
3. Leader Sentinel updates its own configuration with the new primary address
4. Other Sentinels update their configuration
5. Sentinel publishes +switch-master event on the __sentinel__:hello channel

Step 6: Client Notification

Clients that use a Sentinel-aware client library (ioredis with sentinels config, Jedis with Sentinel support, etc.) subscribe to the __sentinel__:hello channel or periodically query Sentinel. When +switch-master fires, the client reconnects to the new primary address.

Failover Duration

A typical Sentinel failover takes:

down-after-milliseconds (5 seconds default) to detect SDOWN
~1 second for ODOWN consensus
~1 second for leader election
~2–5 seconds for replica promotion and reconfiguration

Total: ~10 seconds of write downtime with default settings.

To reduce failover time: lower down-after-milliseconds (at the risk of false positives from brief network blips).

Split-Brain Prevention with min-replicas-to-write

Consider a network partition that isolates the primary from Sentinels but not from some clients:

text

Before partition:
  [Primary] ← clients → [App servers]
      ↕ replication
  [Replica 1][Replica 2]
  [Sentinel 1][Sentinel 2][Sentinel 3]

During partition:
  [Primary] ← clients → [App servers]  ← isolated from Sentinels and replicas

  [Replica 1][Replica 2]
  [Sentinel 1][Sentinel 2][Sentinel 3]

Sentinels cannot reach the old primary → ODOWN → failover → Replica 1 promoted to new primary.

Meanwhile, the old primary is still accepting writes from the clients (they can still reach it). When the partition heals, the old primary reconnects as a replica of the new primary and loses all writes it accepted during the partition.

Prevention: min-replicas-to-write and min-replicas-max-lag on the primary:

text

# Primary redis.conf:
min-replicas-to-write 1
min-replicas-max-lag 10

During the partition, the old primary cannot reach any replica. After 10 seconds, it stops accepting writes. Clients receive errors instead of silently losing data.

ioredis Sentinel Client

typescript

const redis = new Redis({
  sentinels: [
    { host: '10.0.1.50', port: 26379 },
    { host: '10.0.1.51', port: 26379 },
    { host: '10.0.1.52', port: 26379 },
  ],
  name: 'mymaster',              // must match sentinel.conf "monitor" name
  password: 'primary-password',
  sentinelPassword: 'sentinel-password',  // if sentinels require AUTH
  role: 'master',                // 'master' or 'slave' (for read-from-replica)
});

// ioredis automatically queries Sentinels to find the current primary
// and reconnects on failover events (+switch-master)

For read replicas:

typescript

const readRedis = new Redis({
  sentinels: [/* ... */],
  name: 'mymaster',
  role: 'slave',   // connects to a random replica
});

Sentinel CLI Commands

bash

# Connect to a Sentinel
redis-cli -h 10.0.1.50 -p 26379

# Query current master
SENTINEL get-master-addr-by-name mymaster
→ 1) "10.0.1.50"
   2) "6379"

# List all monitored masters
SENTINEL masters

# List replicas for a master
SENTINEL replicas mymaster

# List other Sentinels
SENTINEL sentinels mymaster

# Check Sentinel status
SENTINEL ckquorum mymaster
→ OK 3 usable Sentinels. Quorum and failover authorization can be reached

# Trigger a manual failover (for testing)
SENTINEL failover mymaster

Sentinel Limitations

What Sentinel is not:

It is not a proxy — clients connect directly to the primary, not through Sentinel
It does not provide horizontal scaling — all writes go to one primary
It does not protect against data loss during the replication lag window
It cannot provide fencing tokens for distributed locking

Sentinel does not prevent data loss: If the primary fails before replication completes, writes in the lag window are lost when a replica is promoted. min-replicas-to-write reduces (but does not eliminate) this window.

Sentinel requires client support: Clients must be Sentinel-aware (know to query Sentinel for the primary address) or use a proxy (envoy, Twemproxy) that handles redirection. A client hardcoded to the primary's IP will not automatically reconnect after failover.

Sentinel vs Cluster

Concern	Sentinel	Cluster
Use case	Single-node Redis HA	Horizontal scaling across nodes
Data distribution	All data on one primary	Sharded across 16,384 slots
Write throughput	Limited to one node	Scales with node count
Failover	Automatic (via Sentinel)	Automatic (built-in)
Complexity	Moderate	Higher
Client requirement	Sentinel-aware client	Cluster-aware client
Multi-key ops	Unrestricted	Keys must be on same slot

Use Sentinel when your dataset fits on a single Redis node and you need automatic failover without the operational complexity of Cluster. Use Cluster when you need horizontal scaling beyond what a single node can provide.

Summary

Sentinel is a separate process that monitors Redis primary + replicas and orchestrates automatic failover
SDOWN = one Sentinel's opinion; ODOWN = quorum agreement → triggers failover
Failover sequence: detect ODOWN → elect leader Sentinel → select best replica → promote → reconfigure others → notify clients
Configure quorum 2 with 3 Sentinels — majority agreement prevents false failovers from single-node network issues
min-replicas-to-write 1 + min-replicas-max-lag 10 on the primary prevents split-brain data loss during network partition
ioredis Sentinel client automatically discovers and reconnects to the new primary after failover
Sentinel does not eliminate data loss — writes in the replication lag window are lost on primary failure
Use Sentinel for single-node HA; use Cluster for horizontal scaling

Next: A-9 — Redis Cluster: Hash Slots and Data Distribution — the 16,384 hash slot model, key routing, MOVED vs ASK redirections, and the constraints multi-key commands impose in a Cluster.

Knowledge Check

Why is it recommended to deploy an odd number of Sentinel instances (e.g., 3 instead of 2) in a Redis high-availability architecture?

In a typical Node.js application using ioredis configured with Sentinel, how does the application handle a primary node failure?

A network partition isolates the Redis primary and the application servers from the Sentinels and replicas. The application continues writing to the isolated primary for 5 minutes. The Sentinels, unable to reach the primary, promote a replica. When the network partition heals, what happens to the 5 minutes of data written to the old primary?

Test your knowledge with more question sets

PreviousModule A-7: Master-Replica Replication: PSYNC, Replication Buffer, and Lag Next Module A-9: Redis Cluster: Hash Slots and Data Distribution

Discussion

Join the discussion

Loading comments...