CronJob SchedulingDistributed CoordinationLeader ElectionProgress TrackingStatus API

Scheduling

Orchestrate long-running tasks — cron-style job scheduling, distributed task coordination with locks and leader election, and progress tracking with status APIs.

26 min read9 sections
01

The Big Picture — Why Tasks Need Scheduling

Not every operation should happen immediately. Reports need to generate at midnight. Stale data needs cleanup every hour. Video transcoding takes minutes and can't block the upload response. These tasks need to run at the right time, exactly once, and with visibility into their progress.

Alarm Clock + Team + Progress Board

Think of a factory with three systems working together. The alarm clock (cron scheduler) decides WHEN tasks run — 'Generate the daily report at 6 AM.' The team coordinator (distributed coordination) ensures only ONE person does the job — if 5 workers hear the alarm, only one actually runs the report. The progress board (status API) shows everyone the current state — 'Report: 60% complete, processing sales data.' Without the alarm, tasks don't run on time. Without coordination, the same report generates 5 times. Without the board, nobody knows if it's done or stuck.

Timing

Tasks must run at specific times or intervals. A daily report at midnight, a cleanup job every hour, a retry after 30 minutes. Without scheduling, someone has to manually trigger every job.

🤝

Coordination

In a distributed system with 10 instances, a cron job fires on ALL 10. Without coordination, the same job runs 10 times — 10 duplicate reports, 10 duplicate emails.

📊

Visibility

A video transcoding job takes 15 minutes. Without progress tracking, the user stares at a spinner with no idea if it's 10% done or 90% done — or if it failed silently.

🔥 Key Insight

Scheduling is three problems in one: when to run (timing), who runs it (coordination), and what's happening (visibility). Solving only one creates gaps — a perfectly timed job that runs 10 times, or a coordinated job that nobody can monitor.

02

Scheduling Architecture

Scheduler

Triggers at time

🔒

Coordinator

Ensures single exec

📬

Queue

Buffers work

⚙️

Worker

Executes task

📊

Status Store

Tracks progress

Scheduling System — Component Rolestext
SCHEDULER (timing)
Evaluates cron expressions
Triggers jobs at the right time
Pushes job messages to the queue
Does NOT execute the job itself

COORDINATION LAYER (single execution)
Distributed lock (Redis SETNX / ZooKeeper)
Leader election (only leader triggers jobs)
Lease-based: lock expires if holder crashes
Prevents duplicate execution across instances

QUEUE (decoupling)
Buffers jobs between scheduler and workers
Handles backpressure (workers busyjobs wait)
Provides at-least-once delivery with ack

WORKER (execution)
Picks up jobs from queue
Executes the actual task
Updates progress in status store
Acks the job on completion (or nacks on failure)

STATUS STORE (visibility)
Stores: job_id, status, progress %, logs, errors
Status API: GET /api/jobs/{id} → { status, progress, ... }
Enables UI progress bars, admin dashboards, alerting
03

Cron-Style Job Scheduling

Cron scheduling runs jobs at fixed times or intervals defined by a cron expression. It's the simplest and most widely used scheduling pattern — every operating system, every cloud provider, and most frameworks support it.

Cron Expressions — The Syntaxtext
Format: minute hour day-of-month month day-of-week

Examples:
  "0 * * * *"Every hour (at minute 0)
  "0 0 * * *"Every day at midnight
  "0 6 * * 1"Every Monday at 6:00 AM
  "*/5 * * * *"Every 5 minutes
  "0 0 1 * *"First day of every month at midnight
  "0 9-17 * * 1-5"Every hour from 9 AM to 5 PM, weekdays only

How it works internally:
  1. Scheduler evaluates all registered cron expressions every minute
  2. For each expression that matches the current time:
Create a job message
Push to the job queue
  3. Worker picks up the job and executes it

  Scheduler does NOT execute jobsit only triggers them.
  This separation allows scaling workers independently.

Real-World Use Cases

📊

Daily Reports

Generate sales reports at midnight. Aggregate yesterday's data, build PDF, email to stakeholders. Runs once per day, takes 5-30 minutes.

🧹

Data Cleanup

Delete expired sessions every hour. Purge soft-deleted records after 30 days. Archive old logs. Keeps the database lean.

💓

Health Checks

Ping all downstream services every minute. If a service is down, trigger an alert. Continuous monitoring without manual intervention.

Strengths

  • Simple and universally understood
  • Predictable — runs at exact times
  • No external dependencies (built into OS/framework)
  • Easy to audit (cron expression = schedule)
  • Decades of battle-tested reliability

Limitations

  • Not dynamic — can't schedule 'run in 30 minutes from now'
  • Time drift — clock skew between servers causes inconsistency
  • No built-in coordination — fires on every instance
  • No backpressure — triggers even if previous run isn't done
  • Minimum granularity is typically 1 minute

🎯 Interview Insight

Cron is the foundation, but never use it alone in a distributed system. Say: "I'd use cron to define the schedule, but wrap it with a distributed lock so only one instance triggers the job. The job goes to a queue, and workers execute it. This gives me timing (cron) + coordination (lock) + reliability (queue)."

04

Distributed Task Coordination

In a distributed system with N instances, a cron job fires on all N simultaneously. Without coordination, the same job executes N times. Distributed coordination ensures exactly-once triggering.

The Duplicate Execution Problemtext
5 API server instances, each running the same cron:
  "0 0 * * *"Generate daily report

At midnight:
  Instance 1: cron firesgenerate report ← ✅
  Instance 2: cron firesgenerate report ← ❌ duplicate
  Instance 3: cron firesgenerate report ← ❌ duplicate
  Instance 4: cron firesgenerate report ← ❌ duplicate
  Instance 5: cron firesgenerate report ← ❌ duplicate

Result: 5 identical reports generated, 5 emails sent.
Users get 5 copies. Database does 5x the work.

Coordination Techniques

1

Distributed Lock (Redis SETNX)

Before executing, the instance tries to acquire a lock: SET job:daily_report:2025-01-15 NX EX 300. Only one instance succeeds (NX = set if not exists). The winner executes the job. Others see the lock exists and skip. The lock expires after 5 minutes (EX 300) in case the winner crashes.

2

Leader Election

One instance is elected as the 'leader' (via ZooKeeper, etcd, or a database row). Only the leader runs the scheduler. Other instances are standby. If the leader crashes, a new leader is elected. This is simpler than per-job locking but has a single point of scheduling.

3

Lease-Based Execution

A job is 'leased' to a worker for a fixed duration (e.g., 10 minutes). If the worker completes within the lease, it marks the job done. If it crashes, the lease expires and another worker can pick it up. Combines lock + timeout + retry.

Distributed Lock — Redis Implementationtext
At midnight, all 5 instances try:

  Instance 1: SET "lock:daily_report:2025-01-15" "instance-1" NX EX 300
Response: OK (lock acquired ✅)
Execute jobpush to queuerelease lock

  Instance 2: SET "lock:daily_report:2025-01-15" "instance-2" NX EX 300
Response: nil (lock exists, someone else has it)
Skip execution

  Instance 3-5: same as Instance 2skip

Key design:
Lock key includes the date: prevents re-running tomorrow's job today
NX: atomic set-if-not-exists (no race condition)
EX 300: lock expires in 5 minutes (crash recovery)
Value = instance ID (for debugging: who holds the lock?)

If Instance 1 crashes mid-execution:
Lock expires after 300 seconds
Next cron tick: another instance acquires the lock
Job retries (must be idempotent!)
TechniqueHow It WorksProsConsBest For
Distributed LockSETNX per job executionSimple, per-job granularityLock expiry tuning, Redis dependencyMost cron jobs, simple coordination
Leader ElectionOne instance is the schedulerSimple logic, no per-job locksSingle point of scheduling, failover delaySmall clusters, few scheduled jobs
Lease-BasedJob leased with timeoutHandles crashes gracefully, auto-retryMore complex, needs idempotent jobsLong-running jobs, unreliable workers

🎯 Interview Insight

Distributed lock with Redis is the standard answer. Say: "Each instance tries to acquire a lock with SETNX before executing the cron job. Only one succeeds. The lock includes the job name and date to prevent re-execution. It has a TTL for crash recovery. The job must be idempotent in case of lock expiry and re-execution."

05

Progress Tracking & Status APIs

When a job takes minutes or hours, users and operators need visibility. A status API exposes the current state of every job — pending, running, progress percentage, completion, or failure with error details.

Job Status — State Machinetext
States:
  PENDINGJob created, waiting in queue
  RUNNINGWorker picked it up, executing
  COMPLETEDFinished successfully
  FAILEDFailed (with error details)
  RETRYINGFailed, scheduled for retry

Transitions:
  PENDINGRUNNING    (worker picks up job)
  RUNNINGCOMPLETED  (success)
  RUNNINGFAILED     (error, max retries exceeded)
  RUNNINGRETRYING   (error, will retry)
  RETRYINGRUNNING   (retry attempt starts)

Status API:
  GET /api/jobs/job_abc123

  Response:
  {
    "id": "job_abc123",
    "type": "video_transcode",
    "status": "RUNNING",
    "progress": 65,
    "created_at": "2025-01-15T10:00:00Z",
    "started_at": "2025-01-15T10:00:05Z",
    "updated_at": "2025-01-15T10:03:22Z",
    "metadata": {
      "input": "video_456.mp4",
      "current_step": "Transcoding 720p variant",
      "steps_completed": 2,
      "steps_total": 4
    },
    "attempts": 1,
    "max_attempts": 3
  }

How Workers Report Progress

Worker Progress Updates — Pseudocodetext
function process_video_transcode(job):
  update_status(job.id, "RUNNING", progress=0)

  // Step 1: Download original
  update_status(job.id, "RUNNING", progress=10, step="Downloading original")
  download(job.input_url)

  // Step 2: Transcode 1080p
  update_status(job.id, "RUNNING", progress=30, step="Transcoding 1080p")
  transcode(input, "1080p")

  // Step 3: Transcode 720p
  update_status(job.id, "RUNNING", progress=55, step="Transcoding 720p")
  transcode(input, "720p")

  // Step 4: Transcode 480p + thumbnail
  update_status(job.id, "RUNNING", progress=80, step="Transcoding 480p")
  transcode(input, "480p")
  generate_thumbnail(input)

  // Step 5: Upload variants
  update_status(job.id, "RUNNING", progress=95, step="Uploading variants")
  upload_all_variants()

  update_status(job.id, "COMPLETED", progress=100)

Where status is stored:
Redis (fast writes, good for real-time progress)
Database (durable, good for audit trail)
Both: Redis for live progress, DB for permanent record

What to Track

  • Job status (pending, running, completed, failed)
  • Progress percentage (0-100%)
  • Current step description ('Transcoding 720p')
  • Timestamps (created, started, updated, completed)
  • Attempt count and max attempts
  • Error details on failure (message, stack trace for internal use)

How Clients Consume Status

  • Polling: GET /api/jobs/{id} every 2-5 seconds
  • Long polling: server holds request until status changes
  • WebSocket: server pushes updates in real-time
  • Webhook: server calls client's URL on completion
  • SSE: server streams status updates over HTTP

🎯 Interview Insight

Progress tracking transforms UX. Say: "The worker updates progress in Redis as it completes each step. The client polls GET /api/jobs/{id} every 3 seconds to show a progress bar. On completion, we send a webhook to the client's callback URL. This gives users real-time visibility without blocking."

06

End-to-End Scenario

Let's design a job scheduling system for a platform that generates daily analytics reports and processes user-uploaded videos.

Scheduling System — Full Architecturetext
SCHEDULED JOBS (cron + coordination):

  Job: Daily Analytics Report
  Cron: "0 0 * * *" (midnight)
  Flow:
    1. Cron fires on all 5 instances at midnight
    2. Each instance tries: SETNX "lock:analytics:2025-01-15" EX 300
    3. Instance 3 wins the lock
    4. Instance 3 pushes job to queue: { type: "analytics_report", date: "2025-01-15" }
    5. Worker picks up jobqueries data warehousebuilds report
    6. Worker updates status: RUNNING30% → 60% → 90% → COMPLETED
    7. Report stored in S3, link emailed to stakeholders

ON-DEMAND JOBS (user-triggered):

  Job: Video Transcoding
  Trigger: User uploads video
  Flow:
    1. Upload completesAPI creates job: POST /api/jobs
       { type: "video_transcode", input: "s3://uploads/video_789.mp4" }
    2. Job pushed to priority queue (user-facing = high priority)
    3. Worker picks uptranscodesupdates progress every 10%
    4. Client polls: GET /api/jobs/job_xyz → { status: "RUNNING", progress: 65 }
    5. UI shows: "Processing your video... 65%"
    6. Worker completesstatus: COMPLETEDwebhook fires
    7. Client receives webhookshows "Video ready!"

COORDINATION:
Scheduled jobs: Redis distributed lock (prevent duplicates)
On-demand jobs: queue handles coordination (each message consumed once)
Both: idempotent workers (safe to retry on failure)

MONITORING:
Dashboard: all jobs, status, duration, failure rate
Alerts: job stuck in RUNNING > 30 min, failure rate > 5%
Dead-letter queue: failed jobs after max retriesmanual review

💡 This Is How Production Systems Work

Airflow, Temporal, and Celery all implement this pattern: scheduler (timing) + queue (decoupling) + workers (execution) + status store (visibility) + coordination (single execution). The specific tools vary, but the architecture is universal.

07

Trade-offs & Decision Making

DecisionOption AOption BChoose A WhenChoose B When
Scheduling approachCron (fixed schedule)Dynamic (run at arbitrary time)Recurring jobs (daily reports, cleanup)User-triggered (process in 30 min, retry at specific time)
CoordinationDistributed lock (per-job)Leader election (single scheduler)Many different jobs, independent schedulesFew jobs, simple setup, small cluster
Progress deliveryPolling (client pulls)Push (WebSocket/webhook)Simple, stateless, most use casesReal-time UX needed, long-running jobs
SchedulerIn-app (library-based)External (Airflow, Temporal)Simple cron jobs, small teamComplex DAGs, dependencies, large team

🔧 Simple Stack (most teams)

  • Cron expression in app config
  • Redis distributed lock for coordination
  • SQS/RabbitMQ for job queue
  • Redis for progress, PostgreSQL for audit
  • Polling API for status

🏗️ Advanced Stack (large teams)

  • Airflow / Temporal for orchestration
  • DAG-based job dependencies
  • Built-in retry, timeout, alerting
  • UI dashboard for job management
  • Webhook + SSE for real-time status
08

Interview Questions

Q:How does cron scheduling work in a distributed system?

A: Cron defines WHEN a job should run (e.g., '0 0 * * *' = midnight). In a distributed system with N instances, the cron fires on all N simultaneously. To prevent duplicate execution, wrap it with a distributed lock: each instance tries SETNX in Redis before executing. Only one succeeds. The winner pushes the job to a queue, and a worker executes it. The lock key includes the job name and date (e.g., 'lock:daily_report:2025-01-15') to prevent re-execution. TTL on the lock handles crash recovery.

Q:How do you prevent duplicate job execution?

A: Three approaches: (1) Distributed lock (Redis SETNX) — before executing, acquire a lock. Only one instance succeeds. Lock has TTL for crash recovery. (2) Leader election — one instance is the scheduler, others are standby. Only the leader triggers jobs. (3) Queue-based dedup — push the job to a queue with a deduplication ID. The queue ensures each message is delivered once. In all cases, workers should be idempotent — if a job accidentally runs twice (lock expired, leader failover), the result should be the same.

Q:How do you track job progress in a long-running task?

A: The worker updates a status store (Redis for real-time, DB for persistence) at each step: { job_id, status: 'RUNNING', progress: 65, step: 'Transcoding 720p' }. Clients consume this via: (1) Polling — GET /api/jobs/{id} every 3 seconds. (2) WebSocket — server pushes updates. (3) Webhook — server calls client's URL on completion. The status includes: state (pending/running/completed/failed), progress %, current step, timestamps, attempt count, and error details on failure.

1

You're designing a report generation system that runs daily for 10,000 tenants

How would you schedule and coordinate this?

Answer: (1) Cron triggers at midnight: push 10,000 job messages to a queue (one per tenant). (2) Distributed lock ensures only one instance triggers the batch. (3) 20 workers consume from the queue in parallel — each generates one tenant's report. (4) Each worker updates progress in Redis. (5) Status API: GET /api/reports/{tenant_id}/latest → { status, progress, download_url }. (6) On completion, webhook notifies the tenant. (7) Failed jobs go to a dead-letter queue for manual review. This processes 10,000 reports in parallel instead of sequentially — 20 workers × 5 min/report = ~42 minutes total.

09

Common Pitfalls

👯

Duplicate job execution

Cron fires on all 5 instances. No distributed lock. The daily report generates 5 times. 5 emails sent to every stakeholder. The database does 5x the aggregation work. Users lose trust in the system.

Always wrap cron jobs with a distributed lock (Redis SETNX with TTL). The lock key should include the job name and execution date. Only the instance that acquires the lock triggers the job. All others skip silently.

🕐

Clock drift between servers

Server A's clock is 3 seconds ahead of Server B. Server A's cron fires first, acquires the lock, and runs the job. But sometimes Server B fires first due to NTP corrections. Jobs run at inconsistent times, and occasionally both fire within the lock's acquisition window.

Use NTP to synchronize clocks across all servers. Set lock TTL longer than the maximum expected clock drift (e.g., 60 seconds). Use the queue as the source of truth for job execution — the scheduler only triggers, the queue ensures exactly-once delivery.

🙈

No visibility into job state

A video transcoding job is submitted. The user sees 'Processing...' for 20 minutes with no progress indicator. Is it 10% done? 90% done? Did it fail? The user refreshes, submits again (duplicate), and contacts support. Support has no way to check the job status either.

Implement a status API from day one. Workers update progress at each step. The API returns: status, progress %, current step, timestamps, and error details. The UI shows a progress bar. Support can look up any job by ID. Alerts fire if a job is stuck in RUNNING for too long.

💀

Poor failure handling

A worker crashes mid-execution. The job is marked as RUNNING forever — it never completes, never fails, never retries. The lock is held indefinitely (no TTL). The next scheduled run can't acquire the lock. The job never runs again.

Use lease-based execution: locks have TTL (expire if not renewed). Workers send heartbeats to extend the lease. If a worker crashes, the lease expires and another worker can pick up the job. Set max execution time — if a job exceeds it, mark as FAILED and retry. Always have a dead-letter queue for jobs that fail after max retries.