Module P-6·22 min read

How BullMQ maps job lifecycle to Sorted Sets, Lists, and Hashes. Worker polling, delayed job scheduling, stalled job detection via heartbeat, the rate limiter internals, and choosing BullMQ vs raw Streams.

P-6 — BullMQ Internals: The Redis Data Structures Behind the Job Queue

Q: A BullMQ worker is processing a video transcoding job that takes exactly 45 seconds to complete. The worker is configured with a `lockDuration` of 30,000 milliseconds (30 seconds) and `maxStalledCount` of 1. What will happen during the execution of this job, assuming the worker does not crash?

The job will successfully complete. Although the `lockDuration` is 30 seconds, BullMQ workers automatically run a background interval (by default every `lockDuration / 2`, or 15 seconds) that extends the lock's expiration in Redis. As long as the Node.js event loop isn't entirely blocked, the lock remains valid for the full 45 seconds. — BullMQ uses a lease mechanism for job locks. The `lockDuration` is simply the TTL set on the lock key in Redis. To accommodate long-running jobs without requiring massive initial TTLs, the worker spawns a background timer (`lockRenewTime`) that periodically pings Redis to extend the lock. This ensures that if the worker *actually* crashes, the lock expires relatively quickly (within 30 seconds), but if the worker is healthy and just working slowly, the lock is continuously refreshed until the job completes.

Q: An operations engineer wants to know exactly how many jobs are currently waiting to be processed in the `emails` queue without writing a Node.js script. Which raw Redis command provides this exact number in O(1) time?

`LLEN bull:emails:wait` — In BullMQ's internal architecture, the "waiting" and "active" states are managed using Redis Lists to provide strict FIFO ordering. The `LLEN` command returns the length of a list in O(1) time, making it the perfect, zero-overhead way to monitor queue depth directly from the Redis CLI. (`ZCARD` would be used for delayed, completed, or failed jobs, which are stored in Sorted Sets).

Q: A team deploys a high-throughput BullMQ queue processing 10,000 jobs per minute. After three days, they receive an alert that Redis memory usage has spiked by several gigabytes, eventually triggering an OOM kill. The queue is fully processed (`wait` and `active` lists are empty). What is the most likely architectural misconfiguration?

The application forgot to define `removeOnComplete` and `removeOnFail` in the Worker options. Consequently, BullMQ left the Hash data structure (`bull:emails:{jobId}`) and the Sorted Set entries for all 43 million completed jobs permanently in Redis memory. — By default, BullMQ prioritizes observability over memory efficiency. When a job succeeds or fails, it is moved to the `completed` or `failed` Sorted Sets, and its payload (the job Hash) remains untouched. For high-throughput queues, this unbounded growth will inevitably crash Redis. Production deployments must explicitly configure `removeOnComplete: { count: N }` and `removeOnFail: { count: M }` (or age-based equivalents) to force BullMQ to prune historical data and cap memory usage.

Who this module is for: You use BullMQ (or Bull) for job queues and have run into issues — jobs that get stuck, queues that slow down under load, stalled job detection that is too aggressive or not aggressive enough. This module explains the Redis data structures BullMQ uses for every queue state, so you can reason about its behaviour, tune it correctly, and debug it at the Redis level.

Why Understanding BullMQ Internals Matters

BullMQ is a job queue built on Redis. Most engineers treat it as a black box — they add jobs with queue.add() and process them in a worker.process() function. But when queues misbehave (jobs stay in "active" forever, delayed jobs fire late, rate limits fail), you cannot diagnose or fix the problem without understanding the Redis layer.

Every BullMQ behaviour maps to specific Redis operations. Knowing this lets you:

Query queue state directly with redis-cli without going through BullMQ's API
Understand why a job is "stuck" and fix it
Tune TTL, stall checks, and rate limiter settings appropriately
Identify Redis memory usage caused by large queues

The Key Schema

BullMQ uses a namespaced key prefix. For a queue named emails:

text

bull:emails:id              → String: auto-incrementing job ID counter
bull:emails:wait            → List: jobs waiting to be picked up (FIFO)
bull:emails:active          → List: jobs currently being processed
bull:emails:completed       → Sorted Set: completed jobs (score = completion timestamp)
bull:emails:failed          → Sorted Set: failed jobs (score = failure timestamp)
bull:emails:delayed         → Sorted Set: delayed jobs (score = run-at timestamp)
bull:emails:prioritized     → Sorted Set: priority jobs (score = priority × time)
bull:emails:paused          → List: queue is paused, jobs go here instead of wait
bull:emails:meta            → Hash: queue metadata (paused, maxLen, etc.)
bull:emails:{jobId}         → Hash: job data (id, data, opts, timestamp, etc.)
bull:emails:events          → Stream: BullMQ events (completed, failed, stalled, etc.)
bull:emails:rate-limiter    → String or Hash: rate limiter state
bull:emails:stalled-check:{lockKey} → key used for stall detection

Job Lifecycle in Redis

Adding a Job (queue.add)

javascript

await emailQueue.add('send-welcome', { userId: '1001', email: 'j@example.com' });

What happens in Redis:

INCR bull:emails:id → generates job ID, e.g., 42
HSET bull:emails:42 with all job fields:
- id: "42"
- name: "send-welcome"
- data: '{"userId":"1001","email":"j@example.com"}'
- opts: '{"attempts":1,"delay":0,...}'
- timestamp: "1717000000000"
- delay: "0"
- priority: "0"
RPUSH bull:emails:wait 42 → add job ID to the wait list
XADD bull:emails:events * event added jobId 42 → emit event to the events stream

The job data (step 2) is stored in a Hash for O(1) field access. The queue lists and sorted sets store only the job ID — the actual data is always in the Hash.

Adding a Delayed Job

javascript

await emailQueue.add('send-followup', { userId: '1001' }, { delay: 3600000 }); // 1 hour

Instead of RPUSH bull:emails:wait, BullMQ uses:

ZADD bull:emails:delayed {runAt_timestamp_ms} {jobId}

A scheduler process (the QueueScheduler in Bull v3, built into BullMQ workers) polls the delayed sorted set with:

ZRANGEBYSCORE bull:emails:delayed 0 {now_ms} COUNT 100

When jobs become ready (their score ≤ current timestamp), the scheduler moves them to bull:emails:wait via LPUSH and ZREM.

Adding a Priority Job

javascript

await emailQueue.add('vip-email', { userId: '99' }, { priority: 1 }); // lower = higher priority

ZADD bull:emails:prioritized {priority_score} {jobId}

Workers preferentially consume from prioritized before wait.

Processing a Job (worker picks up)

The worker calls:

LMOVE bull:emails:wait bull:emails:active RIGHT LEFT

This atomically moves the job ID from the tail of wait to the head of active. If no jobs are waiting, the worker calls:

BLMOVE bull:emails:wait bull:emails:active RIGHT LEFT 5

Blocking for up to 5 seconds. When a job arrives, the BLMOVE completes and the job ID is in active.

The worker then reads the job data:

HGETALL bull:emails:{jobId}

And acquires a "lock" on the job:

SET bull:emails:{jobId}:lock {worker_token} PX 30000 NX

This lock prevents another worker from claiming the same job. The lock expires in 30 seconds (configurable with lockDuration).

Job Completion

javascript

// Worker signals success
await job.moveToCompleted('email sent', workerToken);

BullMQ executes a Lua script that atomically:

Verifies the worker still holds the lock (GET bull:emails:{jobId}:lock)
LREM bull:emails:active 0 {jobId} — removes from active list
ZADD bull:emails:completed {timestamp} {jobId} — adds to completed set
Optionally trims completed set if removeOnComplete is configured
DEL bull:emails:{jobId}:lock — releases the lock
XADD bull:emails:events * event completed jobId {jobId} — emits event

Job Failure

javascript

// Worker signals failure (after all retries exhausted)
await job.moveToFailed(error, workerToken);

Similar Lua script:

Verify lock
LREM bull:emails:active 0 {jobId}
If retries remain: RPUSH bull:emails:wait {jobId} (or with backoff delay: ZADD bull:emails:delayed ...)
If no retries remain: ZADD bull:emails:failed {timestamp} {jobId}
Update job Hash with failedReason, stacktrace, attemptsMade
Release lock, emit event

Stalled Job Detection

A job becomes "stalled" when the worker crashes (SIGKILL, OOM) after moving the job to active but before completing or failing it. The lock expires but no worker claims the job — it is stuck in active indefinitely.

The stall check runs periodically (configurable with stalledInterval, default 30 seconds):

javascript

// Worker's internal stall check (runs in QueueEvents or Worker itself)
// Checks all jobs in 'active' that have an expired lock

The Lua-based stall check:

Scans bull:emails:active for job IDs
For each: checks if bull:emails:{jobId}:lock exists
If the lock does not exist (expired): the job is stalled
If attemptsMade < maxAttempts: moves back to wait (retry)
If exhausted retries: moves to failed

javascript

// Configure stall detection
const worker = new Worker('emails', processor, {
  stalledInterval: 30000,  // check every 30 seconds
  maxStalledCount: 1,      // mark as failed after 1 stall
  lockDuration: 30000,     // lock expires in 30 seconds
  lockRenewTime: 15000,    // renew lock every 15 seconds
});

Tuning stall detection:

lockDuration should be longer than the maximum expected job processing time
lockRenewTime is automatically set to lockDuration / 2 — the worker renews its lock halfway through the duration
If a job legitimately takes 5 minutes: set lockDuration: 360000 (6 minutes)
maxStalledCount: 0 means stalled jobs are retried indefinitely (dangerous for infinite loops)

Rate Limiter Internals

javascript

const worker = new Worker('emails', processor, {
  limiter: {
    max: 100,
    duration: 1000,  // 100 jobs per second
  },
});

BullMQ's rate limiter uses a sliding window implemented with a Sorted Set:

bull:emails:rate-limiter → Sorted Set: {jobId} with score = timestamp

Before processing each job, the worker:

Removes entries older than duration ms from the rate limiter key
Counts remaining entries
If count ≥ max: delays the current job by inserting it back into delayed for the next window
Otherwise: increments the window counter and proceeds

Querying Queue State Directly

With this knowledge, you can inspect BullMQ queues using raw Redis commands:

bash

# How many jobs are waiting?
redis-cli LLEN bull:emails:wait

# How many jobs are active?
redis-cli LLEN bull:emails:active

# What jobs are active? (get their IDs)
redis-cli LRANGE bull:emails:active 0 -1

# Get details of a specific job
redis-cli HGETALL bull:emails:42

# What delayed jobs are coming up in the next 60 seconds?
redis-cli ZRANGEBYSCORE bull:emails:delayed 0 $(($(date +%s%3N) + 60000)) WITHSCORES

# How many failed jobs?
redis-cli ZCARD bull:emails:failed

# View the events stream
redis-cli XREVRANGE bull:emails:events + - COUNT 10

Memory Considerations

For high-throughput queues, BullMQ keys accumulate:

Completed jobs: bull:emails:{jobId} Hashes persist after completion unless removeOnComplete is set
Failed jobs: Same — persist forever unless removeOnFail

javascript

// Recommended: auto-remove jobs after a count or age
const worker = new Worker('emails', processor, {
  removeOnComplete: { count: 1000 },   // keep last 1000 completed
  removeOnFail: { count: 500 },        // keep last 500 failed
});

Without this, a queue processing 1,000 jobs/hour generates 24,000 job Hashes per day. Each Hash is ~300–500 bytes. At 1M jobs total: ~300–500MB just for the job Hashes.

The completed and failed Sorted Sets also grow unboundedly. removeOnComplete.count limits the Sorted Set size by trimming (ZREMRANGEBYRANK) after each completion.

Summary

BullMQ uses wait (List) for FIFO queuing, active (List) for in-progress jobs, completed/failed (Sorted Sets) for history, delayed (Sorted Set with timestamp score) for scheduling
Job data lives in a Hash bull:{queue}:{jobId}; queues store only the ID
Workers use LMOVE wait active (atomic) to claim jobs; a Lua-based lock prevents double-processing
Stalled jobs (lock expired, still in active) are detected and retried or failed by the stall checker
Tune lockDuration to exceed max job processing time; lockRenewTime defaults to half lockDuration
Rate limiting uses a sliding window Sorted Set — delayed jobs are re-queued when the window is full
Enable removeOnComplete and removeOnFail to prevent unbounded memory growth
Query queue state directly with Redis commands for debugging without the BullMQ API overhead

Next: P-7 — Cache Stampede, Avalanche, and Penetration — three cache failure modes that look similar in monitoring but require different solutions.

Knowledge Check

A BullMQ worker is processing a video transcoding job that takes exactly 45 seconds to complete. The worker is configured with a lockDuration of 30,000 milliseconds (30 seconds) and maxStalledCount of 1. What will happen during the execution of this job, assuming the worker does not crash?

An operations engineer wants to know exactly how many jobs are currently waiting to be processed in the emails queue without writing a Node.js script. Which raw Redis command provides this exact number in O(1) time?

A team deploys a high-throughput BullMQ queue processing 10,000 jobs per minute. After three days, they receive an alert that Redis memory usage has spiked by several gigabytes, eventually triggering an OOM kill. The queue is fully processed (wait and active lists are empty). What is the most likely architectural misconfiguration?

Test your knowledge with more question sets

PreviousModule P-5: Atomic Counters, Rate Limiters, and Sliding Windows Next Module P-7: Cache Stampede, Avalanche, and Penetration

Discussion

Join the discussion

Loading comments...