Job QueueMessage QueuePriority QueueDead-Letter QueueDLQBackground JobsAsync

Task Queuing

Master async job management — job queues vs message queues, priority queues, and dead-letter queues. Decouple task submission from execution for reliable background processing.

24 min read9 sections
01

The Big Picture — Why Queue Tasks?

A user uploads a video. Processing it (transcoding, thumbnail generation, content moderation) takes 3 minutes. You can't make the user wait 3 minutes for an HTTP response. You can't run it inside the request handler — it would time out. The solution: accept the upload, put a "process this video" task on a queue, return immediately, and let a background worker handle the heavy lifting.

🍽️

The Restaurant Order Analogy

You walk up to the counter and place an order. The cashier doesn't cook your food right there — they write a ticket (task), put it on the order rail (queue), and hand you a receipt with an order number (job ID). You sit down and wait. In the kitchen, cooks (workers) pick up tickets and prepare food. When your order is ready, your number is called. The cashier (API) is free to serve the next customer immediately. The kitchen (workers) processes orders at its own pace. If the kitchen is backed up, orders queue up — but no customer is blocked at the counter.

🔥 Key Insight

Task queuing decouples submission from execution. The API accepts the request in milliseconds and returns a job ID. The actual work happens asynchronously in a worker process. This keeps the API fast, prevents timeouts, and lets you scale workers independently based on workload.

02

Task Queuing Overview

Task Queuing Architecturetext
ClientAPIQueueWorker(s) → Result Store

1. Client: "Process this video" (POST /uploads)
2. API: creates task, enqueues it, returns job_id
   Response: { "job_id": "job-abc-123", "status": "queued" }
Client gets response in ~50ms

3. Queue: holds the task until a worker picks it up
   Tasks: [job-abc-123, job-def-456, job-ghi-789, ...]

4. Worker: dequeues task, processes it (3 minutes)
Transcode video, generate thumbnails, run moderation

5. Result Store: worker writes result
Update DB: job-abc-123 status = "completed", output_url = "..."

6. Client polls or gets notified:
   GET /jobs/job-abc-123 → { "status": "completed", "url": "..." }
   OR: WebSocket/push notification when done

Async Processing

The API doesn't wait for the task to finish. It returns immediately with a job ID. The client checks back later or gets a notification. No timeouts, no blocking.

📈

Load Smoothing

100 video uploads arrive in 1 second. 5 workers process 1 video/minute. The queue holds 100 tasks; workers drain them over 20 minutes. No spike overwhelms the system.

03

Job Queues vs Message Queues

Job Queues — "Do This Work"

A job queue is designed for task execution. You submit a job (a unit of work), a worker picks it up, executes it, and reports the result. The queue tracks job state: queued → running → completed/failed.

Job Queue — Featurestext
Job: {
  id: "job-abc-123",
  type: "transcode_video",
  payload: { video_id: "vid-42", format: "720p" },
  status: "queued",        ← tracked by the queue
  retries: 0,              ← auto-retry on failure
  max_retries: 3,
  scheduled_at: "2024-06-15T10:00:00Z",  ← delayed execution
  created_at: "2024-06-15T09:55:00Z"
}

Lifecycle:
  queuedrunningcompleted
failedretryrunningcompleted
failed (max retries) → dead ❌ (→ DLQ)

Features:
Status tracking (query job state anytime)
Retry with backoff (automatic on failure)
Scheduling (run at a specific time)
Priority (urgent jobs first)
Timeout (kill jobs that run too long)

Examples: Sidekiq (Ruby), Celery (Python), BullMQ (Node.js), Temporal

Message Queues — "Something Happened"

A message queue is designed for communication between services. A producer publishes a message (an event or command), one or more consumers receive it. The queue doesn't track what happens after delivery — it just ensures the message is delivered.

Message Queue — Featurestext
Message: {
  topic: "order-events",
  key: "user-42",
  value: { "event": "OrderPlaced", "order_id": "ord-123", "total": 99.99 },
  timestamp: "2024-06-15T10:00:00Z"
}

Lifecycle:
  publisheddeliveredacknowledged
  (no status tracking, no retry management by the queue itself)

Features:
Pub/Sub (one messagemany consumers)
Ordering (within a partition)
Replay (consumers can re-read old messages)
High throughput (millions of messages/sec)
No job status tracking
No built-in retry/scheduling (consumer handles this)

Examples: Kafka, SQS, RabbitMQ, Redis Streams
DimensionJob QueueMessage Queue
PurposeExecute a taskDeliver a message/event
Semantics'Do this work''Something happened'
Status trackingYes (queued, running, done, failed)No (delivered or not)
Retry logicBuilt-in (configurable)Consumer must implement
SchedulingYes (run at specific time)No (immediate delivery)
Fan-outOne worker per jobMultiple consumers per message
ThroughputModerate (task overhead)Very high (lightweight messages)
ExamplesSidekiq, Celery, BullMQKafka, SQS, RabbitMQ
Use caseVideo processing, email, reportsEvent streaming, service communication

🎯 Interview Insight

Job queue = "do work." Message queue = "notify others." Use a job queue when you need to execute a task and track its progress (video transcoding, report generation). Use a message queue when you need to broadcast an event to multiple services (OrderPlaced → payment, inventory, notification). Many systems use both.

04

Priority Queues

Not all tasks are equal. A password reset email should be sent in seconds. A weekly analytics report can wait hours. Priority queues ensure urgent tasks are processed before less important ones.

Priority Queue — How It Workstext
Three priority levels:

HIGH priority queue:   [password-reset, payment-alert, security-warning]
MEDIUM priority queue: [order-confirmation, shipping-update]
LOW priority queue:    [weekly-report, data-export, cleanup-job]

Worker behavior:
  1. Check HIGH queueany tasks? → process
  2. If HIGH is emptycheck MEDIUMprocess
  3. If MEDIUM is emptycheck LOWprocess

Result:
  Password reset email: sent in 2 seconds
  Order confirmation: sent in 30 seconds
  Weekly report: generated in 2 hours

Implementation options:
  Option A: Separate queues per priority (simplest)
    Workers poll HIGH first, then MEDIUM, then LOW

  Option B: Single queue with priority field (heap-based)
    Queue sorts by priority internally
    Higher priority tasks dequeued first

  Option C: Weighted fair queuing
    HIGH gets 60% of worker capacity
    MEDIUM gets 30%
    LOW gets 10%
Prevents starvation of low-priority tasks

Benefits

  • Critical tasks processed first (password resets, alerts)
  • Better user experience (urgent actions feel instant)
  • Resource allocation matches business importance
  • Simple to implement with separate queues

Challenges

  • Starvation: low-priority tasks never run if high-priority is always full
  • Priority inversion: low-priority task holds a resource high-priority needs
  • Complexity: managing multiple queues and worker allocation
  • Fairness: must ensure all priorities eventually get processed

🎯 Interview Insight

When designing a system with mixed task urgency, always mention priority queues. Say: "I'd use separate queues for high, medium, and low priority. Workers check high first. To prevent starvation, I'd use weighted fair queuing — 60% capacity for high, 30% medium, 10% low. This ensures urgent tasks are fast without starving batch jobs."

05

Dead-Letter Queues (DLQ)

A dead-letter queue is where failed messages go to die — or rather, to wait for investigation. When a task fails repeatedly (after all retries are exhausted), it's moved to the DLQ instead of being retried forever or silently dropped.

Dead-Letter Queue — How It Workstext
Task: "Send welcome email to user-42"

Attempt 1: Email service is downFAILEDretry in 10s
Attempt 2: Email service still downFAILEDretry in 30s
Attempt 3: Email service returns 500FAILEDretry in 60s
Attempt 4 (max retries reached): FAILEDmove to DLQ

Main Queue:  [task-1, task-2, task-3, ...]  ← healthy tasks
Dead-Letter: [failed-task-A, failed-task-B]  ← failed tasks

DLQ contains:
  {
    original_task: { type: "send_email", user_id: 42 },
    failure_reason: "Email service returned 500",
    attempts: 4,
    first_failed_at: "2024-06-15T10:00:00Z",
    last_failed_at: "2024-06-15T10:01:30Z"
  }

What happens next:
  1. Alert: monitoring detects DLQ depth > 0page on-call engineer
  2. Investigate: engineer checks failure reasonemail service was down
  3. Fix: email service is restored
  4. Replay: move tasks from DLQ back to main queueprocessed successfully
🛡️

Fault Isolation

Poison messages (malformed, invalid) don't block the queue. They're moved to DLQ after max retries. Healthy messages continue processing.

🔍

Debugging

DLQ preserves the failed message with its error context. Engineers can inspect why it failed, fix the issue, and replay the message.

💾

No Silent Data Loss

Without DLQ: failed messages are dropped after max retries. With DLQ: they're preserved for investigation and replay. No data is silently lost.

When to use DLQ

  • Any production queue system (it's a best practice, not optional)
  • Payment processing (failed charges must be investigated)
  • Event processing (failed events can't be silently dropped)
  • Email/notification sending (failed sends need retry after fix)

DLQ operations

  • Monitor: alert when DLQ depth > 0
  • Inspect: view failed messages and error reasons
  • Replay: move messages back to main queue after fixing the issue
  • Purge: delete messages that are no longer relevant

🎯 Interview Insight

Always mention DLQ when discussing queues. It shows you think about failure handling. Say: "After 3 retries with exponential backoff, failed tasks move to a dead-letter queue. We monitor DLQ depth and alert on-call. Engineers investigate, fix the root cause, and replay the failed tasks. No message is silently lost."

06

End-to-End Scenario

Let's design the background job system for a video platform — where every upload triggers multiple long-running tasks.

🎬 Video Platform — 50K Uploads/Day

Each upload triggers: transcoding (3 formats), thumbnail generation, content moderation, notification.

Transcoding takes 2-10 minutes. Thumbnails take 30 seconds. Moderation takes 1 minute.

1

Upload API → Job Queue (immediate response)

User uploads video. API stores the file in S3, creates a 'process_video' job in the job queue, and returns immediately: { job_id: 'job-123', status: 'queued' }. User sees 'Processing your video...' The API response takes 200ms, not 10 minutes.

2

Priority queues → Urgent vs batch

Paid users' videos go to the HIGH priority queue (processed in minutes). Free users go to LOW priority (processed in hours during off-peak). Content moderation flagged as CRITICAL goes to a dedicated queue with its own workers — never delayed by transcoding backlog.

3

Workers process tasks

Transcoding workers (10 instances) pick up jobs from the queue. Each worker: dequeue → download video from S3 → transcode to 720p/1080p/4K → upload results to S3 → update job status to 'completed.' If a worker crashes mid-transcode, the job times out and is re-queued automatically.

4

Failures → Retry → DLQ

Transcoding fails (corrupt video file). Retry 1: same error. Retry 2: same error. Retry 3: same error. Job moves to DLQ. Alert fires. Engineer inspects: 'corrupt file, can't transcode.' Notifies user: 'Your video could not be processed. Please re-upload.' DLQ entry is marked as resolved.

5

Completion → Notify user

All tasks complete. Job status: 'completed.' Push notification sent to user: 'Your video is ready!' Or: client polls GET /jobs/job-123 and sees status change. Video page now shows all formats and thumbnails.

Architecture — Video Processing Pipelinetext
Upload API

  ├── Store video in S3
  ├── Create job in queue
  └── Return job_id to client (200ms)

Job Queue (BullMQ / Celery):
  HIGH:     [paid-user-videos]
  MEDIUM:   [free-user-videos]
  CRITICAL: [content-moderation]

Workers:
  Transcode workers (10x):  HIGHMEDIUM queue
  Thumbnail workers (5x):   all priorities
  Moderation workers (3x):  CRITICAL queue only

Failure handling:
  Retry: 3 attempts, exponential backoff (10s, 30s, 90s)
  DLQ: after 3 failuresdead-letter queue
  Alert: DLQ depth > 0page on-call

Status tracking:
  GET /jobs/job-123 → { status: "running", progress: 65% }
Client shows progress bar
On completion: push notification + status update
07

Trade-offs & Decision Making

DimensionJob QueueMessage Queue
Use whenYou need to execute work and track itYou need to notify services about events
Status trackingBuilt-in (queued/running/done/failed)None (consumer tracks its own state)
RetryBuilt-in with backoffConsumer implements retry
Fan-outOne worker per jobMultiple consumers per message
ThroughputModerate (task overhead)Very high (lightweight)
Best forVideo processing, emails, reportsEvent streaming, service communication

Priority vs Fairness

StrategyUrgent TasksBatch TasksStarvation RiskComplexity
Single queue (FIFO)Wait in lineWait in lineNoneLowest
Separate priority queuesProcessed firstProcessed lastHigh (low never runs)Low
Weighted fair queuing60% capacity10% capacityLow (guaranteed minimum)Medium
Dedicated workers per priorityOwn workersOwn workersNone (independent)Higher

🎯 Decision Framework

Start with a single FIFO queue. When you need urgency differentiation, add priority queues (separate queues, workers check high first). When low-priority starvation becomes a problem, switch to weighted fair queuing. Always have a DLQ — it's not optional, it's a reliability requirement.

08

Interview Questions

Q:Job queue vs message queue — what's the difference?

A: A job queue is for task execution: submit a job, a worker executes it, the queue tracks status (queued/running/done/failed), retries on failure, supports scheduling and priority. Think: 'do this work.' A message queue is for communication: publish an event, one or more consumers receive it, the queue ensures delivery but doesn't track what happens after. Think: 'something happened.' Use job queues for background tasks (video processing, emails). Use message queues for event-driven communication between services.

Q:Why use priority queues?

A: When tasks have different urgency levels. A password reset email must be sent in seconds — it can't wait behind 10,000 batch report jobs. Priority queues ensure critical tasks are processed first. Implementation: separate queues per priority level, workers check high-priority first. To prevent starvation of low-priority tasks, use weighted fair queuing — allocate a guaranteed percentage of worker capacity to each priority level.

Q:What is a dead-letter queue?

A: A DLQ stores messages/jobs that failed after all retry attempts. Instead of dropping them (data loss) or retrying forever (infinite loop), they're moved to a separate queue for investigation. Benefits: (1) fault isolation — poison messages don't block the main queue, (2) debugging — failed messages are preserved with error context, (3) replay — after fixing the issue, move messages back to the main queue. Every production queue system should have a DLQ. Monitor its depth and alert when it's non-empty.

1

Your email sending service processes 100K emails/day but some fail

How do you handle failures reliably?

Answer: Use a job queue with retry and DLQ. On failure: retry 3 times with exponential backoff (10s, 30s, 90s). If still failing after 3 attempts, move to DLQ. Monitor DLQ depth — alert if > 0. Common failure causes: invalid email address (permanent, don't retry), email service down (temporary, retry works), rate limited (temporary, retry with longer backoff). For permanent failures, mark as failed and notify the user. For temporary failures, the retry mechanism handles it automatically.

2

Paid users complain their video processing takes too long

How do you prioritize paid users?

Answer: Separate priority queues. Paid users → HIGH queue, free users → LOW queue. Transcoding workers check HIGH first — paid videos are processed within minutes. Free users are processed during off-peak hours or when HIGH is empty. To prevent free users from never being processed: use weighted fair queuing — 70% capacity for HIGH, 30% for LOW. This guarantees paid users are fast while ensuring free users still get processed within a reasonable time.

09

Pitfalls

🔁

Not handling retries properly

Retrying immediately on failure, or retrying with the same interval. The downstream service is overloaded — hammering it with retries makes it worse. Or: no retry at all — a transient network error causes permanent task failure.

Use exponential backoff with jitter: retry after 1s, 2s, 4s, 8s (+ random jitter to prevent thundering herd). Set a max retry count (3-5). After max retries, move to DLQ. Distinguish between retryable errors (timeout, 503) and permanent errors (400, invalid input) — don't retry permanent errors.

♾️

Infinite retry loops

A malformed message causes the worker to crash every time it's processed. Without a max retry limit, the message is retried forever — consuming worker capacity and never succeeding. The queue appears healthy (messages are being processed) but nothing is actually completing.

Always set max_retries (3-5 is typical). After max retries, move to DLQ. Log the failure reason with each attempt. Monitor: if the same job has been retried 3 times, it's likely a permanent issue — stop retrying and investigate.

🗑️

Ignoring DLQ

Setting up a DLQ but never monitoring it. Failed messages accumulate silently. Customers never receive their emails, videos are never processed, payments are never completed — and nobody knows until customers complain.

Monitor DLQ depth as a critical metric. Alert immediately when DLQ depth > 0. Build a DLQ dashboard: show failed messages, error reasons, timestamps. Build a replay mechanism: one-click to move messages back to the main queue after fixing the issue. Review DLQ daily as part of operations.

🏋️

Overloading workers

Running 100 concurrent tasks on a worker with 4 CPU cores. Each task uses 100% of a core for transcoding. Workers thrash, tasks take 10x longer, memory runs out, workers crash — creating more retries and more load.

Match worker concurrency to available resources. CPU-bound tasks (transcoding): 1 task per core. I/O-bound tasks (API calls, emails): 10-50 concurrent tasks per worker. Monitor worker CPU, memory, and task duration. Auto-scale workers based on queue depth — add workers when the queue grows, remove when it shrinks.