Task Queuing
Master async job management — job queues vs message queues, priority queues, and dead-letter queues. Decouple task submission from execution for reliable background processing.
Table of Contents
The Big Picture — Why Queue Tasks?
A user uploads a video. Processing it (transcoding, thumbnail generation, content moderation) takes 3 minutes. You can't make the user wait 3 minutes for an HTTP response. You can't run it inside the request handler — it would time out. The solution: accept the upload, put a "process this video" task on a queue, return immediately, and let a background worker handle the heavy lifting.
The Restaurant Order Analogy
You walk up to the counter and place an order. The cashier doesn't cook your food right there — they write a ticket (task), put it on the order rail (queue), and hand you a receipt with an order number (job ID). You sit down and wait. In the kitchen, cooks (workers) pick up tickets and prepare food. When your order is ready, your number is called. The cashier (API) is free to serve the next customer immediately. The kitchen (workers) processes orders at its own pace. If the kitchen is backed up, orders queue up — but no customer is blocked at the counter.
🔥 Key Insight
Task queuing decouples submission from execution. The API accepts the request in milliseconds and returns a job ID. The actual work happens asynchronously in a worker process. This keeps the API fast, prevents timeouts, and lets you scale workers independently based on workload.
Task Queuing Overview
Client → API → Queue → Worker(s) → Result Store 1. Client: "Process this video" (POST /uploads) 2. API: creates task, enqueues it, returns job_id Response: { "job_id": "job-abc-123", "status": "queued" } → Client gets response in ~50ms 3. Queue: holds the task until a worker picks it up Tasks: [job-abc-123, job-def-456, job-ghi-789, ...] 4. Worker: dequeues task, processes it (3 minutes) → Transcode video, generate thumbnails, run moderation 5. Result Store: worker writes result → Update DB: job-abc-123 status = "completed", output_url = "..." 6. Client polls or gets notified: GET /jobs/job-abc-123 → { "status": "completed", "url": "..." } OR: WebSocket/push notification when done
Async Processing
The API doesn't wait for the task to finish. It returns immediately with a job ID. The client checks back later or gets a notification. No timeouts, no blocking.
Load Smoothing
100 video uploads arrive in 1 second. 5 workers process 1 video/minute. The queue holds 100 tasks; workers drain them over 20 minutes. No spike overwhelms the system.
Job Queues vs Message Queues
Job Queues — "Do This Work"
A job queue is designed for task execution. You submit a job (a unit of work), a worker picks it up, executes it, and reports the result. The queue tracks job state: queued → running → completed/failed.
Job: { id: "job-abc-123", type: "transcode_video", payload: { video_id: "vid-42", format: "720p" }, status: "queued", ← tracked by the queue retries: 0, ← auto-retry on failure max_retries: 3, scheduled_at: "2024-06-15T10:00:00Z", ← delayed execution created_at: "2024-06-15T09:55:00Z" } Lifecycle: queued → running → completed ✅ → failed → retry → running → completed ✅ → failed (max retries) → dead ❌ (→ DLQ) Features: ✅ Status tracking (query job state anytime) ✅ Retry with backoff (automatic on failure) ✅ Scheduling (run at a specific time) ✅ Priority (urgent jobs first) ✅ Timeout (kill jobs that run too long) Examples: Sidekiq (Ruby), Celery (Python), BullMQ (Node.js), Temporal
Message Queues — "Something Happened"
A message queue is designed for communication between services. A producer publishes a message (an event or command), one or more consumers receive it. The queue doesn't track what happens after delivery — it just ensures the message is delivered.
Message: { topic: "order-events", key: "user-42", value: { "event": "OrderPlaced", "order_id": "ord-123", "total": 99.99 }, timestamp: "2024-06-15T10:00:00Z" } Lifecycle: published → delivered → acknowledged (no status tracking, no retry management by the queue itself) Features: ✅ Pub/Sub (one message → many consumers) ✅ Ordering (within a partition) ✅ Replay (consumers can re-read old messages) ✅ High throughput (millions of messages/sec) ❌ No job status tracking ❌ No built-in retry/scheduling (consumer handles this) Examples: Kafka, SQS, RabbitMQ, Redis Streams
| Dimension | Job Queue | Message Queue |
|---|---|---|
| Purpose | Execute a task | Deliver a message/event |
| Semantics | 'Do this work' | 'Something happened' |
| Status tracking | Yes (queued, running, done, failed) | No (delivered or not) |
| Retry logic | Built-in (configurable) | Consumer must implement |
| Scheduling | Yes (run at specific time) | No (immediate delivery) |
| Fan-out | One worker per job | Multiple consumers per message |
| Throughput | Moderate (task overhead) | Very high (lightweight messages) |
| Examples | Sidekiq, Celery, BullMQ | Kafka, SQS, RabbitMQ |
| Use case | Video processing, email, reports | Event streaming, service communication |
🎯 Interview Insight
Job queue = "do work." Message queue = "notify others." Use a job queue when you need to execute a task and track its progress (video transcoding, report generation). Use a message queue when you need to broadcast an event to multiple services (OrderPlaced → payment, inventory, notification). Many systems use both.
Priority Queues
Not all tasks are equal. A password reset email should be sent in seconds. A weekly analytics report can wait hours. Priority queues ensure urgent tasks are processed before less important ones.
Three priority levels: HIGH priority queue: [password-reset, payment-alert, security-warning] MEDIUM priority queue: [order-confirmation, shipping-update] LOW priority queue: [weekly-report, data-export, cleanup-job] Worker behavior: 1. Check HIGH queue → any tasks? → process 2. If HIGH is empty → check MEDIUM → process 3. If MEDIUM is empty → check LOW → process Result: Password reset email: sent in 2 seconds ✅ Order confirmation: sent in 30 seconds ✅ Weekly report: generated in 2 hours ✅ Implementation options: Option A: Separate queues per priority (simplest) Workers poll HIGH first, then MEDIUM, then LOW Option B: Single queue with priority field (heap-based) Queue sorts by priority internally Higher priority tasks dequeued first Option C: Weighted fair queuing HIGH gets 60% of worker capacity MEDIUM gets 30% LOW gets 10% → Prevents starvation of low-priority tasks
Benefits
- ✅Critical tasks processed first (password resets, alerts)
- ✅Better user experience (urgent actions feel instant)
- ✅Resource allocation matches business importance
- ✅Simple to implement with separate queues
Challenges
- ❌Starvation: low-priority tasks never run if high-priority is always full
- ❌Priority inversion: low-priority task holds a resource high-priority needs
- ❌Complexity: managing multiple queues and worker allocation
- ❌Fairness: must ensure all priorities eventually get processed
🎯 Interview Insight
When designing a system with mixed task urgency, always mention priority queues. Say: "I'd use separate queues for high, medium, and low priority. Workers check high first. To prevent starvation, I'd use weighted fair queuing — 60% capacity for high, 30% medium, 10% low. This ensures urgent tasks are fast without starving batch jobs."
Dead-Letter Queues (DLQ)
A dead-letter queue is where failed messages go to die — or rather, to wait for investigation. When a task fails repeatedly (after all retries are exhausted), it's moved to the DLQ instead of being retried forever or silently dropped.
Task: "Send welcome email to user-42" Attempt 1: Email service is down → FAILED → retry in 10s Attempt 2: Email service still down → FAILED → retry in 30s Attempt 3: Email service returns 500 → FAILED → retry in 60s Attempt 4 (max retries reached): FAILED → move to DLQ Main Queue: [task-1, task-2, task-3, ...] ← healthy tasks Dead-Letter: [failed-task-A, failed-task-B] ← failed tasks DLQ contains: { original_task: { type: "send_email", user_id: 42 }, failure_reason: "Email service returned 500", attempts: 4, first_failed_at: "2024-06-15T10:00:00Z", last_failed_at: "2024-06-15T10:01:30Z" } What happens next: 1. Alert: monitoring detects DLQ depth > 0 → page on-call engineer 2. Investigate: engineer checks failure reason → email service was down 3. Fix: email service is restored 4. Replay: move tasks from DLQ back to main queue → processed successfully
Fault Isolation
Poison messages (malformed, invalid) don't block the queue. They're moved to DLQ after max retries. Healthy messages continue processing.
Debugging
DLQ preserves the failed message with its error context. Engineers can inspect why it failed, fix the issue, and replay the message.
No Silent Data Loss
Without DLQ: failed messages are dropped after max retries. With DLQ: they're preserved for investigation and replay. No data is silently lost.
When to use DLQ
- ✅Any production queue system (it's a best practice, not optional)
- ✅Payment processing (failed charges must be investigated)
- ✅Event processing (failed events can't be silently dropped)
- ✅Email/notification sending (failed sends need retry after fix)
DLQ operations
- ✅Monitor: alert when DLQ depth > 0
- ✅Inspect: view failed messages and error reasons
- ✅Replay: move messages back to main queue after fixing the issue
- ✅Purge: delete messages that are no longer relevant
🎯 Interview Insight
Always mention DLQ when discussing queues. It shows you think about failure handling. Say: "After 3 retries with exponential backoff, failed tasks move to a dead-letter queue. We monitor DLQ depth and alert on-call. Engineers investigate, fix the root cause, and replay the failed tasks. No message is silently lost."
End-to-End Scenario
Let's design the background job system for a video platform — where every upload triggers multiple long-running tasks.
🎬 Video Platform — 50K Uploads/Day
Each upload triggers: transcoding (3 formats), thumbnail generation, content moderation, notification.
Transcoding takes 2-10 minutes. Thumbnails take 30 seconds. Moderation takes 1 minute.
Upload API → Job Queue (immediate response)
User uploads video. API stores the file in S3, creates a 'process_video' job in the job queue, and returns immediately: { job_id: 'job-123', status: 'queued' }. User sees 'Processing your video...' The API response takes 200ms, not 10 minutes.
Priority queues → Urgent vs batch
Paid users' videos go to the HIGH priority queue (processed in minutes). Free users go to LOW priority (processed in hours during off-peak). Content moderation flagged as CRITICAL goes to a dedicated queue with its own workers — never delayed by transcoding backlog.
Workers process tasks
Transcoding workers (10 instances) pick up jobs from the queue. Each worker: dequeue → download video from S3 → transcode to 720p/1080p/4K → upload results to S3 → update job status to 'completed.' If a worker crashes mid-transcode, the job times out and is re-queued automatically.
Failures → Retry → DLQ
Transcoding fails (corrupt video file). Retry 1: same error. Retry 2: same error. Retry 3: same error. Job moves to DLQ. Alert fires. Engineer inspects: 'corrupt file, can't transcode.' Notifies user: 'Your video could not be processed. Please re-upload.' DLQ entry is marked as resolved.
Completion → Notify user
All tasks complete. Job status: 'completed.' Push notification sent to user: 'Your video is ready!' Or: client polls GET /jobs/job-123 and sees status change. Video page now shows all formats and thumbnails.
Upload API │ ├── Store video in S3 ├── Create job in queue └── Return job_id to client (200ms) Job Queue (BullMQ / Celery): HIGH: [paid-user-videos] MEDIUM: [free-user-videos] CRITICAL: [content-moderation] Workers: Transcode workers (10x): HIGH → MEDIUM queue Thumbnail workers (5x): all priorities Moderation workers (3x): CRITICAL queue only Failure handling: Retry: 3 attempts, exponential backoff (10s, 30s, 90s) DLQ: after 3 failures → dead-letter queue Alert: DLQ depth > 0 → page on-call Status tracking: GET /jobs/job-123 → { status: "running", progress: 65% } → Client shows progress bar → On completion: push notification + status update
Trade-offs & Decision Making
| Dimension | Job Queue | Message Queue |
|---|---|---|
| Use when | You need to execute work and track it | You need to notify services about events |
| Status tracking | Built-in (queued/running/done/failed) | None (consumer tracks its own state) |
| Retry | Built-in with backoff | Consumer implements retry |
| Fan-out | One worker per job | Multiple consumers per message |
| Throughput | Moderate (task overhead) | Very high (lightweight) |
| Best for | Video processing, emails, reports | Event streaming, service communication |
Priority vs Fairness
| Strategy | Urgent Tasks | Batch Tasks | Starvation Risk | Complexity |
|---|---|---|---|---|
| Single queue (FIFO) | Wait in line | Wait in line | None | Lowest |
| Separate priority queues | Processed first | Processed last | High (low never runs) | Low |
| Weighted fair queuing | 60% capacity | 10% capacity | Low (guaranteed minimum) | Medium |
| Dedicated workers per priority | Own workers | Own workers | None (independent) | Higher |
🎯 Decision Framework
Start with a single FIFO queue. When you need urgency differentiation, add priority queues (separate queues, workers check high first). When low-priority starvation becomes a problem, switch to weighted fair queuing. Always have a DLQ — it's not optional, it's a reliability requirement.
Interview Questions
Q:Job queue vs message queue — what's the difference?
A: A job queue is for task execution: submit a job, a worker executes it, the queue tracks status (queued/running/done/failed), retries on failure, supports scheduling and priority. Think: 'do this work.' A message queue is for communication: publish an event, one or more consumers receive it, the queue ensures delivery but doesn't track what happens after. Think: 'something happened.' Use job queues for background tasks (video processing, emails). Use message queues for event-driven communication between services.
Q:Why use priority queues?
A: When tasks have different urgency levels. A password reset email must be sent in seconds — it can't wait behind 10,000 batch report jobs. Priority queues ensure critical tasks are processed first. Implementation: separate queues per priority level, workers check high-priority first. To prevent starvation of low-priority tasks, use weighted fair queuing — allocate a guaranteed percentage of worker capacity to each priority level.
Q:What is a dead-letter queue?
A: A DLQ stores messages/jobs that failed after all retry attempts. Instead of dropping them (data loss) or retrying forever (infinite loop), they're moved to a separate queue for investigation. Benefits: (1) fault isolation — poison messages don't block the main queue, (2) debugging — failed messages are preserved with error context, (3) replay — after fixing the issue, move messages back to the main queue. Every production queue system should have a DLQ. Monitor its depth and alert when it's non-empty.
Your email sending service processes 100K emails/day but some fail
How do you handle failures reliably?
Answer: Use a job queue with retry and DLQ. On failure: retry 3 times with exponential backoff (10s, 30s, 90s). If still failing after 3 attempts, move to DLQ. Monitor DLQ depth — alert if > 0. Common failure causes: invalid email address (permanent, don't retry), email service down (temporary, retry works), rate limited (temporary, retry with longer backoff). For permanent failures, mark as failed and notify the user. For temporary failures, the retry mechanism handles it automatically.
Paid users complain their video processing takes too long
How do you prioritize paid users?
Answer: Separate priority queues. Paid users → HIGH queue, free users → LOW queue. Transcoding workers check HIGH first — paid videos are processed within minutes. Free users are processed during off-peak hours or when HIGH is empty. To prevent free users from never being processed: use weighted fair queuing — 70% capacity for HIGH, 30% for LOW. This guarantees paid users are fast while ensuring free users still get processed within a reasonable time.
Pitfalls
Not handling retries properly
Retrying immediately on failure, or retrying with the same interval. The downstream service is overloaded — hammering it with retries makes it worse. Or: no retry at all — a transient network error causes permanent task failure.
✅Use exponential backoff with jitter: retry after 1s, 2s, 4s, 8s (+ random jitter to prevent thundering herd). Set a max retry count (3-5). After max retries, move to DLQ. Distinguish between retryable errors (timeout, 503) and permanent errors (400, invalid input) — don't retry permanent errors.
Infinite retry loops
A malformed message causes the worker to crash every time it's processed. Without a max retry limit, the message is retried forever — consuming worker capacity and never succeeding. The queue appears healthy (messages are being processed) but nothing is actually completing.
✅Always set max_retries (3-5 is typical). After max retries, move to DLQ. Log the failure reason with each attempt. Monitor: if the same job has been retried 3 times, it's likely a permanent issue — stop retrying and investigate.
Ignoring DLQ
Setting up a DLQ but never monitoring it. Failed messages accumulate silently. Customers never receive their emails, videos are never processed, payments are never completed — and nobody knows until customers complain.
✅Monitor DLQ depth as a critical metric. Alert immediately when DLQ depth > 0. Build a DLQ dashboard: show failed messages, error reasons, timestamps. Build a replay mechanism: one-click to move messages back to the main queue after fixing the issue. Review DLQ daily as part of operations.
Overloading workers
Running 100 concurrent tasks on a worker with 4 CPU cores. Each task uses 100% of a core for transcoding. Workers thrash, tasks take 10x longer, memory runs out, workers crash — creating more retries and more load.
✅Match worker concurrency to available resources. CPU-bound tasks (transcoding): 1 task per core. I/O-bound tasks (API calls, emails): 10-50 concurrent tasks per worker. Monitor worker CPU, memory, and task duration. Auto-scale workers based on queue depth — add workers when the queue grows, remove when it shrinks.