Module P-6·22 min read

How BullMQ maps job lifecycle to Sorted Sets, Lists, and Hashes. Worker polling, delayed job scheduling, stalled job detection via heartbeat, the rate limiter internals, and choosing BullMQ vs raw Streams.

P-6 — BullMQ Internals: The Redis Data Structures Behind the Job Queue

Who this module is for: You use BullMQ (or Bull) for job queues and have run into issues — jobs that get stuck, queues that slow down under load, stalled job detection that is too aggressive or not aggressive enough. This module explains the Redis data structures BullMQ uses for every queue state, so you can reason about its behaviour, tune it correctly, and debug it at the Redis level.


Why Understanding BullMQ Internals Matters

BullMQ is a job queue built on Redis. Most engineers treat it as a black box — they add jobs with queue.add() and process them in a worker.process() function. But when queues misbehave (jobs stay in "active" forever, delayed jobs fire late, rate limits fail), you cannot diagnose or fix the problem without understanding the Redis layer.

Every BullMQ behaviour maps to specific Redis operations. Knowing this lets you:

  • Query queue state directly with redis-cli without going through BullMQ's API
  • Understand why a job is "stuck" and fix it
  • Tune TTL, stall checks, and rate limiter settings appropriately
  • Identify Redis memory usage caused by large queues

The Key Schema

BullMQ uses a namespaced key prefix. For a queue named emails:

bull:emails:id              → String: auto-incrementing job ID counter
bull:emails:wait            → List: jobs waiting to be picked up (FIFO)
bull:emails:active          → List: jobs currently being processed
bull:emails:completed       → Sorted Set: completed jobs (score = completion timestamp)
bull:emails:failed          → Sorted Set: failed jobs (score = failure timestamp)
bull:emails:delayed         → Sorted Set: delayed jobs (score = run-at timestamp)
bull:emails:prioritized     → Sorted Set: priority jobs (score = priority × time)
bull:emails:paused          → List: queue is paused, jobs go here instead of wait
bull:emails:meta            → Hash: queue metadata (paused, maxLen, etc.)
bull:emails:{jobId}         → Hash: job data (id, data, opts, timestamp, etc.)
bull:emails:events          → Stream: BullMQ events (completed, failed, stalled, etc.)
bull:emails:rate-limiter    → String or Hash: rate limiter state
bull:emails:stalled-check:{lockKey} → key used for stall detection

Job Lifecycle in Redis

Adding a Job (queue.add)

javascript
await emailQueue.add('send-welcome', { userId: '1001', email: 'j@example.com' });

What happens in Redis:

  1. INCR bull:emails:id → generates job ID, e.g., 42
  2. HSET bull:emails:42 with all job fields:
    • id: "42"
    • name: "send-welcome"
    • data: '{"userId":"1001","email":"j@example.com"}'
    • opts: '{"attempts":1,"delay":0,...}'
    • timestamp: "1717000000000"
    • delay: "0"
    • priority: "0"
  3. RPUSH bull:emails:wait 42 → add job ID to the wait list
  4. XADD bull:emails:events * event added jobId 42 → emit event to the events stream

The job data (step 2) is stored in a Hash for O(1) field access. The queue lists and sorted sets store only the job ID — the actual data is always in the Hash.

Adding a Delayed Job

javascript
await emailQueue.add('send-followup', { userId: '1001' }, { delay: 3600000 }); // 1 hour

Instead of RPUSH bull:emails:wait, BullMQ uses:

ZADD bull:emails:delayed {runAt_timestamp_ms} {jobId}

A scheduler process (the QueueScheduler in Bull v3, built into BullMQ workers) polls the delayed sorted set with:

ZRANGEBYSCORE bull:emails:delayed 0 {now_ms} COUNT 100

When jobs become ready (their score ≤ current timestamp), the scheduler moves them to bull:emails:wait via LPUSH and ZREM.

Adding a Priority Job

javascript
await emailQueue.add('vip-email', { userId: '99' }, { priority: 1 }); // lower = higher priority
ZADD bull:emails:prioritized {priority_score} {jobId}

Workers preferentially consume from prioritized before wait.

Processing a Job (worker picks up)

The worker calls:

LMOVE bull:emails:wait bull:emails:active RIGHT LEFT

This atomically moves the job ID from the tail of wait to the head of active. If no jobs are waiting, the worker calls:

BLMOVE bull:emails:wait bull:emails:active RIGHT LEFT 5

Blocking for up to 5 seconds. When a job arrives, the BLMOVE completes and the job ID is in active.

The worker then reads the job data:

HGETALL bull:emails:{jobId}

And acquires a "lock" on the job:

SET bull:emails:{jobId}:lock {worker_token} PX 30000 NX

This lock prevents another worker from claiming the same job. The lock expires in 30 seconds (configurable with lockDuration).

Job Completion

javascript
// Worker signals success await job.moveToCompleted('email sent', workerToken);

BullMQ executes a Lua script that atomically:

  1. Verifies the worker still holds the lock (GET bull:emails:{jobId}:lock)
  2. LREM bull:emails:active 0 {jobId} — removes from active list
  3. ZADD bull:emails:completed {timestamp} {jobId} — adds to completed set
  4. Optionally trims completed set if removeOnComplete is configured
  5. DEL bull:emails:{jobId}:lock — releases the lock
  6. XADD bull:emails:events * event completed jobId {jobId} — emits event

Job Failure

javascript
// Worker signals failure (after all retries exhausted) await job.moveToFailed(error, workerToken);

Similar Lua script:

  1. Verify lock
  2. LREM bull:emails:active 0 {jobId}
  3. If retries remain: RPUSH bull:emails:wait {jobId} (or with backoff delay: ZADD bull:emails:delayed ...)
  4. If no retries remain: ZADD bull:emails:failed {timestamp} {jobId}
  5. Update job Hash with failedReason, stacktrace, attemptsMade
  6. Release lock, emit event

Stalled Job Detection

A job becomes "stalled" when the worker crashes (SIGKILL, OOM) after moving the job to active but before completing or failing it. The lock expires but no worker claims the job — it is stuck in active indefinitely.

The stall check runs periodically (configurable with stalledInterval, default 30 seconds):

javascript
// Worker's internal stall check (runs in QueueEvents or Worker itself) // Checks all jobs in 'active' that have an expired lock

The Lua-based stall check:

  1. Scans bull:emails:active for job IDs
  2. For each: checks if bull:emails:{jobId}:lock exists
  3. If the lock does not exist (expired): the job is stalled
  4. If attemptsMade < maxAttempts: moves back to wait (retry)
  5. If exhausted retries: moves to failed
javascript
// Configure stall detection const worker = new Worker('emails', processor, { stalledInterval: 30000, // check every 30 seconds maxStalledCount: 1, // mark as failed after 1 stall lockDuration: 30000, // lock expires in 30 seconds lockRenewTime: 15000, // renew lock every 15 seconds });

Tuning stall detection:

  • lockDuration should be longer than the maximum expected job processing time
  • lockRenewTime is automatically set to lockDuration / 2 — the worker renews its lock halfway through the duration
  • If a job legitimately takes 5 minutes: set lockDuration: 360000 (6 minutes)
  • maxStalledCount: 0 means stalled jobs are retried indefinitely (dangerous for infinite loops)

Rate Limiter Internals

javascript
const worker = new Worker('emails', processor, { limiter: { max: 100, duration: 1000, // 100 jobs per second }, });

BullMQ's rate limiter uses a sliding window implemented with a Sorted Set:

bull:emails:rate-limiter → Sorted Set: {jobId} with score = timestamp

Before processing each job, the worker:

  1. Removes entries older than duration ms from the rate limiter key
  2. Counts remaining entries
  3. If count ≥ max: delays the current job by inserting it back into delayed for the next window
  4. Otherwise: increments the window counter and proceeds

Querying Queue State Directly

With this knowledge, you can inspect BullMQ queues using raw Redis commands:

bash
# How many jobs are waiting? redis-cli LLEN bull:emails:wait # How many jobs are active? redis-cli LLEN bull:emails:active # What jobs are active? (get their IDs) redis-cli LRANGE bull:emails:active 0 -1 # Get details of a specific job redis-cli HGETALL bull:emails:42 # What delayed jobs are coming up in the next 60 seconds? redis-cli ZRANGEBYSCORE bull:emails:delayed 0 $(($(date +%s%3N) + 60000)) WITHSCORES # How many failed jobs? redis-cli ZCARD bull:emails:failed # View the events stream redis-cli XREVRANGE bull:emails:events + - COUNT 10

Memory Considerations

For high-throughput queues, BullMQ keys accumulate:

  • Completed jobs: bull:emails:{jobId} Hashes persist after completion unless removeOnComplete is set
  • Failed jobs: Same — persist forever unless removeOnFail
javascript
// Recommended: auto-remove jobs after a count or age const worker = new Worker('emails', processor, { removeOnComplete: { count: 1000 }, // keep last 1000 completed removeOnFail: { count: 500 }, // keep last 500 failed });

Without this, a queue processing 1,000 jobs/hour generates 24,000 job Hashes per day. Each Hash is ~300–500 bytes. At 1M jobs total: ~300–500MB just for the job Hashes.

The completed and failed Sorted Sets also grow unboundedly. removeOnComplete.count limits the Sorted Set size by trimming (ZREMRANGEBYRANK) after each completion.


Summary

  • BullMQ uses wait (List) for FIFO queuing, active (List) for in-progress jobs, completed/failed (Sorted Sets) for history, delayed (Sorted Set with timestamp score) for scheduling
  • Job data lives in a Hash bull:{queue}:{jobId}; queues store only the ID
  • Workers use LMOVE wait active (atomic) to claim jobs; a Lua-based lock prevents double-processing
  • Stalled jobs (lock expired, still in active) are detected and retried or failed by the stall checker
  • Tune lockDuration to exceed max job processing time; lockRenewTime defaults to half lockDuration
  • Rate limiting uses a sliding window Sorted Set — delayed jobs are re-queued when the window is full
  • Enable removeOnComplete and removeOnFail to prevent unbounded memory growth
  • Query queue state directly with Redis commands for debugging without the BullMQ API overhead

Next: P-7 — Cache Stampede, Avalanche, and Penetration — three cache failure modes that look similar in monitoring but require different solutions.

© 2026 Jatin Jain Saraf (JJS). All rights reserved.