Module A-1·20 min read

SET key value NX PX as the atomic lock primitive, UUID lock values to prevent accidental release, lock extension with conditional PEXPIRE, the critical GC-pause failure mode, and why distributed locks need fencing tokens.

A-1 — Single-Instance Locking: SET NX PX and Lock Correctness

Q: A junior developer implements a Redis distributed lock using the following two commands: `SETNX lock:resource "holder"` followed immediately by `EXPIRE lock:resource 30`. Why is this implementation fundamentally unsafe in a production environment?

If the application crashes or encounters a network partition between executing the `SETNX` and `EXPIRE` commands, the lock will never expire, resulting in a permanent deadlock for that resource. — Atomicity is the core requirement for acquiring a lock safely. If the operation is split into two separate commands, there is a tiny window of vulnerability. If the client process dies (e.g., OOM killed) exactly after the `SETNX` succeeds but before the `EXPIRE` is sent, the key is left in Redis indefinitely. No other client will ever be able to acquire the lock. The single command `SET lock:resource "value" NX EX 30` guarantees that either the lock is acquired with an expiry, or it is not acquired at all.

Q: Why must a Redis distributed lock be released using a Lua script rather than a simple `DEL` command?

A Lua script ensures that the client only deletes the lock if the value stored in the key exactly matches the unique identifier (e.g., UUID) the client originally set, preventing a slow client from accidentally releasing a lock that has expired and been acquired by another client. — If a client experiences a long pause (like a GC pause) and its lock expires, another client can safely acquire the lock. When the first client wakes up, it believes it still holds the lock and will attempt to release it. If it uses a simple `DEL` command, it will successfully delete the *second* client's lock. By using a Lua script to atomically check `if GET(key) == my_uuid then DEL(key)`, we ensure that a client can only release a lock that it currently holds.

Q: What is the fundamental distributed systems limitation that makes a pure Redis-based lock unsuitable for protecting operations that require strict, absolute exclusion (such as financial deductions where double-spending is catastrophic)?

A process can acquire the lock, pause (e.g., due to garbage collection or CPU scheduling) for longer than the lock's TTL, and resume execution after the lock has expired and been acquired by a second process, leading to both processes executing the critical section simultaneously. — This is the classic "fencing token" problem described by Martin Kleppmann. A Redis lock relies on physical time (TTL). The client application relies on its own perception of time. If the client pauses, physical time continues to pass, the TTL expires, and the lock is handed out again. The paused client wakes up completely unaware that time has passed and proceeds into the critical section, resulting in two active concurrent writers. True mutual exclusion in distributed systems requires the underlying resource being protected (like a database) to reject writes with outdated fencing tokens.

Who this module is for: You need to ensure only one process at a time executes a critical section — a payment deduction, a job processing step, a cache recompute. This module covers the correct Redis single-instance lock primitive, the mistakes that make naive implementations unsafe, and the fundamental limitation that requires fencing tokens for true correctness.

The Lock Primitive: SET NX PX

A Redis distributed lock uses a single key with three properties:

Existence — the key exists means the lock is held
Identity — the key's value identifies the lock holder (prevents accidental release)
Expiry — the key has a TTL so it auto-releases if the holder crashes

The atomic primitive that satisfies all three in a single command:

SET lock:resource "lock-value" NX PX 30000

NX — only set if the key does Not eXist (acquire only if nobody holds the lock)
PX 30000 — expire in 30,000 milliseconds (auto-release if holder crashes)

Returns OK if the lock was acquired, nil if already held by another client.

Why this must be a single command: A non-atomic acquire would be:

text

# Wrong (race condition):
EXISTS lock:resource         → 0 (not held)
SET lock:resource "holder"   → sets the lock
EXPIRE lock:resource 30000   → sets expiry

Between EXISTS and SET, another client can acquire the lock. Between SET and EXPIRE, a crash leaves a permanent lock. The SET key value NX PX single command eliminates both races.

Lock Value: UUID for Identity

The lock value must uniquely identify the holder. Use a cryptographically random UUID:

typescript

import { randomUUID } from 'crypto';

const lockValue = randomUUID();  // e.g., "f47ac10b-58cc-4372-a567-0e02b2c3d479"
await redis.set('lock:payment:1001', lockValue, 'NX', 'PX', 30000);

Why identity matters: Without a unique value, any client can release any lock:

text

# Wrong (no identity):
Client A: SET lock:payment ""  → acquires lock
# ... lock expires while A is slow ...
Client B: SET lock:payment ""  → acquires lock (A's lock expired)
Client A: DEL lock:payment     → releases B's lock! A doesn't know it lost the lock.

With a UUID, releasing the lock requires presenting the same UUID that was set:

typescript

// Safe release via Lua (atomic check-and-delete)
const releaseScript = `
  if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
  else
    return 0
  end
`;

async function releaseLock(key: string, lockValue: string): Promise<boolean> {
  const result = await redis.eval(releaseScript, 1, key, lockValue);
  return result === 1;
}

The Lua script atomically checks that the current lock value matches before deleting — if the lock expired and was acquired by another client, the check fails and we do not release their lock.

Full Lock Implementation

typescript

import { randomUUID } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis();

const RELEASE_SCRIPT = `
  if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
  else
    return 0
  end
`;

interface Lock {
  release: () => Promise<boolean>;
  extend: (ttlMs: number) => Promise<boolean>;
}

async function acquireLock(
  resource: string,
  ttlMs: number,
  retries = 3,
  retryDelayMs = 100
): Promise<Lock | null> {
  const key = `lock:${resource}`;
  const value = randomUUID();

  for (let attempt = 0; attempt <= retries; attempt++) {
    const result = await redis.set(key, value, 'NX', 'PX', ttlMs);

    if (result === 'OK') {
      return {
        release: () => releaseLock(key, value),
        extend: (newTtlMs) => extendLock(key, value, newTtlMs),
      };
    }

    if (attempt < retries) {
      await new Promise(r => setTimeout(r, retryDelayMs + Math.random() * retryDelayMs));
    }
  }

  return null;  // Could not acquire after all retries
}

async function releaseLock(key: string, value: string): Promise<boolean> {
  const result = await redis.eval(RELEASE_SCRIPT, 1, key, value);
  return result === 1;
}

async function extendLock(key: string, value: string, newTtlMs: number): Promise<boolean> {
  const extendScript = `
    if redis.call("GET", KEYS[1]) == ARGV[1] then
      return redis.call("PEXPIRE", KEYS[1], ARGV[2])
    else
      return 0
    end
  `;
  const result = await redis.eval(extendScript, 1, key, value, String(newTtlMs));
  return result === 1;
}

// Usage:
async function processPayment(paymentId: string) {
  const lock = await acquireLock(`payment:${paymentId}`, 30000);

  if (!lock) {
    throw new Error('Could not acquire lock — payment already in progress');
  }

  try {
    await executePaymentLogic(paymentId);
  } finally {
    await lock.release();
  }
}

Lock Extension (Heartbeat Pattern)

If your critical section takes longer than the lock TTL, the lock expires before the work is done — another client acquires it and you have two concurrent holders.

Solutions:

1. Set TTL generously — if work takes up to 5 seconds, set TTL to 30 seconds. Simple but wasteful (a crash holds the lock for up to 30 seconds).

2. Watchdog / heartbeat — a background timer extends the lock while the work is running:

typescript

async function acquireLockWithWatchdog(resource: string, ttlMs: number): Promise<Lock | null> {
  const key = `lock:${resource}`;
  const value = randomUUID();

  const result = await redis.set(key, value, 'NX', 'PX', ttlMs);
  if (result !== 'OK') return null;

  // Watchdog: extend lock every ttlMs/3
  const watchdogInterval = setInterval(async () => {
    const extended = await extendLock(key, value, ttlMs);
    if (!extended) {
      // Lock was lost (expired and taken by another client)
      clearInterval(watchdogInterval);
      // Signal the work loop to abort
    }
  }, ttlMs / 3);

  return {
    release: async () => {
      clearInterval(watchdogInterval);
      return releaseLock(key, value);
    },
    extend: (newTtl) => extendLock(key, value, newTtl),
  };
}

The watchdog extends the lock to ttlMs every ttlMs/3 milliseconds — the lock never expires while the holder is alive and the watchdog is running.

The Fundamental Limitation: Process Pauses

Here is the scenario that breaks even a correctly implemented single-instance lock:

text

Time 0ms: Client A acquires lock (TTL = 30s)
Time 1ms: Client A begins critical section
Time 5000ms: Client A's JVM pauses for GC (stop-the-world collection)
Time 35000ms: Lock expires (30s have passed during the GC pause)
Time 35001ms: Client B acquires the lock
Time 35002ms: Client B begins critical section
Time 40000ms: Client A resumes from GC pause — A still thinks it holds the lock
             A and B are now BOTH executing the critical section

This is not a bug in the lock implementation — it is a fundamental property of distributed systems. Any client can pause for an arbitrary duration (GC, OS scheduling, VM migration, network partition). When it resumes, its lock may have expired and been acquired by another client.

The lock does not know the holder paused. The holder does not know the lock expired.

Fencing Tokens: The Correct Solution

A fencing token is a monotonically increasing integer issued by the lock service when a lock is acquired. The holder passes the fencing token to the resource being protected. The resource rejects any operation with a token lower than the highest it has seen.

text

Client A acquires lock → receives token 42
Client A pauses (GC) → lock expires
Client B acquires lock → receives token 43
Client B writes to database with token 43
Client A resumes → writes to database with token 42
Database: token 42 < max_seen (43) → REJECT Client A's write

Redis cannot natively provide fencing tokens — it has no global monotonically increasing counter that is tied to lock acquisitions. You need an external sequencer (ZooKeeper, etcd, or a database sequence).

Practical implication: For most application-level distributed locking (preventing double processing a job, preventing concurrent cache recomputes), process pauses shorter than the lock TTL are acceptable. The probability of a GC pause lasting 30+ seconds is low for most JVM/Node.js applications.

For operations where two concurrent executions would cause data corruption with no recovery path (bank transfers, inventory deduction, ledger entries), use database transactions with row-level locking — not Redis locks.

When Single-Instance Locks Are Correct

Use Case	Single-Instance Lock Appropriate?
Cache stampede prevention (lock + recompute)	Yes — double recompute is wasteful but not corrupting
Job queue deduplication	Yes — double-processing a job is usually idempotent
Rate limiter coordination	Better handled with INCR+EXPIRE
Inventory reservation (read-then-write)	Risky — consider DB transaction with SELECT FOR UPDATE
Financial deductions (write once)	No — use DB transaction with row lock + idempotency key
Distributed coordination across services	Use etcd or ZooKeeper (have fencing token support)

Summary

The correct lock primitive: SET key value NX PX milliseconds — atomic, single command
Lock value must be a unique UUID — required for safe release (only the holder can release its lock)
Release via Lua script: check UUID matches before DEL — atomic check-and-delete
Lock extension via Lua: PEXPIRE only if UUID matches — prevents extending someone else's lock
Watchdog pattern: background timer extends the lock every TTL/3 to prevent expiry during long operations
Fundamental limitation: process pauses (GC, OS scheduling) can cause a client to hold a lock past its TTL without knowing — two clients can simultaneously believe they hold the lock
Fencing tokens are the correct solution — monotonically increasing integers rejected by the resource on out-of-order writes; Redis cannot natively provide them
Use Redis locks for best-effort coordination where double-execution is safe; use DB row locks for true exclusive access with no tolerance for concurrent execution

Next: A-2 — Lua Scripting: EVAL, EVALSHA, and Atomic Compound Operations — why Lua scripts execute atomically, the KEYS/ARGV convention, and operations that cannot be implemented safely without Lua.

Knowledge Check

A junior developer implements a Redis distributed lock using the following two commands: SETNX lock:resource "holder" followed immediately by EXPIRE lock:resource 30. Why is this implementation fundamentally unsafe in a production environment?

Why must a Redis distributed lock be released using a Lua script rather than a simple DEL command?

What is the fundamental distributed systems limitation that makes a pure Redis-based lock unsuitable for protecting operations that require strict, absolute exclusion (such as financial deductions where double-spending is catastrophic)?

Test your knowledge with more question sets

PreviousModule P-12: Security: ACLs, TLS, and Network Hardening Next Module A-2: Lua Scripting: EVAL, EVALSHA, and Atomic Compound Operations

Discussion

Join the discussion

Loading comments...