CDNPresigned URLsTranscodingDeduplicationContent HashingVideo PipelineAdaptive Streaming

Delivery

Serve large files at scale — CDN delivery with presigned URLs, transcoding pipelines for media processing, and deduplication with content hashing for storage optimization.

28 min read9 sections
01

The Big Picture — Why Serving Large Files Is Hard

Serving a 5KB JSON API response is trivial. Serving a 500MB video to 10 million users worldwide is an entirely different problem. The bandwidth alone would crush your origin servers, the latency for distant users would be unacceptable, and storing duplicate copies of the same file wastes petabytes of storage.

🏭

The Factory vs Local Warehouses

Imagine a factory in Virginia that makes products. If every customer worldwide had to drive to Virginia to pick up their order, the factory parking lot would be gridlocked and customers in Tokyo would wait weeks. The solution: stock popular products in local warehouses (CDN edges) near major cities. Customers pick up from the nearest warehouse — fast, no factory congestion. The factory only ships to warehouses, not to individual customers. Presigned URLs are like pickup tickets — they prove you're authorized to collect your order without the factory verifying each person.

📡

Bandwidth

A 100MB video × 1M downloads = 100 TB of bandwidth. Your origin server can't handle that. CDN edges absorb 99% of this traffic.

⏱️

Latency

A user in Tokyo downloading from Virginia: 150ms round trip × hundreds of packets = seconds of delay. A CDN edge in Tokyo: 10ms. 15x faster.

💾

Storage Waste

100 users upload the same meme. Without deduplication, you store 100 copies. With content hashing, you store 1 copy and 100 references.

🔥 Key Insight

The three pillars of large file delivery: CDN (serve from the edge, not the origin), presigned URLs (bypass the backend for file transfer), and deduplication (store each unique file exactly once). Every file-heavy system — YouTube, Dropbox, Instagram — uses all three.

02

Delivery Architecture

⬆️

Upload

Client → Object Storage

⚙️

Process

Queue → Workers

💾

Store

Multiple variants

🌍

Deliver

CDN → Client

Separation of Upload, Processing, and Deliverytext
UPLOAD PATH (write):
  ClientPresigned Upload URLObject Storage (S3)
Backend never touches the file bytes
S3 triggers eventProcessing Queue

PROCESSING PATH (async):
  QueueWorker picks up job
Download original from S3
Transcode: 1080p, 720p, 480p, thumbnail
Upload variants back to S3
Update metadata DB: "video_123 ready, 3 variants"

DELIVERY PATH (read):
  Client requests video_123
Backend generates presigned CDN URL (720p variant)
Client fetches directly from CDN edge
CDN cache HIT? → serve from edge (10ms)
CDN cache MISS? → fetch from S3 origin, cache, serve

Key insight: the backend NEVER serves file bytes.
  Upload: clientS3 directly (presigned upload URL)
  Download: clientCDN directly (presigned download URL)
  Backend only handles: auth, metadata, URL generation

✅ Backend Handles

  • Authentication and authorization
  • Generating presigned URLs (upload + download)
  • Metadata management (file info, variants, status)
  • Triggering processing pipelines
  • Deduplication checks (hash lookup)

❌ Backend Does NOT Handle

  • File upload bytes (client → S3 directly)
  • File download bytes (client → CDN directly)
  • Transcoding (async workers handle this)
  • Serving static assets (CDN handles this)
03

CDN + Presigned URLs

A presigned URL is a temporary, authenticated URL that grants access to a specific object in storage without requiring the client to have storage credentials. The backend generates the URL (signed with its credentials), and the client uses it to upload or download directly from storage or CDN.

Presigned URL — How It Workstext
DOWNLOAD FLOW:

  1. Client: GET /api/videos/123
  2. Backend:
Verify auth (is this user allowed to access video 123?)
Generate presigned URL:
       https://cdn.example.com/videos/123/720p.mp4
         ?X-Amz-Algorithm=AWS4-HMAC-SHA256
         &X-Amz-Credential=AKIA.../20250115/us-east-1/s3/aws4_request
         &X-Amz-Date=20250115T120000Z
         &X-Amz-Expires=3600valid for 1 hour
         &X-Amz-Signature=abc123...   ← cryptographic signature
Return URL to client

  3. Client: fetches video directly from CDN using the presigned URL
CDN edge has it cached? → serve immediately (10ms)
CDN cache miss? → CDN fetches from S3, caches, serves

  Backend involvement: ~5ms (auth + URL generation)
  File transfer: 0 bytes through backend

UPLOAD FLOW:

  1. Client: POST /api/uploads (request upload URL)
  2. Backend:
Generate presigned upload URL for S3
Return URL + upload instructions
  3. Client: PUT file directly to S3 using presigned URL
     → 500MB video goes straight to S3, not through backend
  4. S3 triggers eventprocessing pipeline starts

CDN Caching for Files

Content TypeCache StrategyTTLInvalidation
Images (profile pics)Cache aggressively30 daysNew URL on change (content hash in filename)
Videos (uploaded content)Cache aggressively1 yearImmutable — URL includes version/hash
ThumbnailsCache aggressively30 daysRegenerate with new URL on change
User-specific filesDon't cache on shared CDNN/AUse presigned URLs with short expiry
Public assets (CSS/JS)Cache with versioning1 yearFilename includes content hash (app.a1b2c3.js)

Benefits

  • Backend serves 0 file bytes — only metadata and URLs
  • CDN absorbs 99%+ of download traffic
  • Presigned URLs provide time-limited, secure access
  • Upload goes directly to S3 — no backend bottleneck
  • Global delivery via CDN edges (low latency worldwide)

Considerations

  • Presigned URLs can be shared (anyone with the URL can access)
  • Cache invalidation is complex (use content-hash URLs instead)
  • URL expiration must be tuned (too short = broken links, too long = security risk)
  • CDN costs scale with bandwidth (can be significant for video)
  • Private content needs signed cookies or short-lived URLs

🎯 Interview Insight

Presigned URLs are the standard answer for "how do you handle file uploads/downloads at scale?" Say: "The client uploads directly to S3 via a presigned URL — the backend never touches the file bytes. For downloads, the backend generates a presigned CDN URL. This means the backend handles only auth and metadata, while S3 and CDN handle all file transfer."

04

Transcoding Pipelines

Transcoding converts uploaded files into multiple formats and resolutions optimized for different devices and network conditions. A 4K video uploaded from a phone needs to be available as 1080p, 720p, 480p, and thumbnail — each encoded for efficient streaming.

Transcoding Pipeline — Architecturetext
Upload event triggers pipeline:

  S3 Event: "video_123.mp4 uploaded"


  Message Queue (SQS / Kafka)


  Transcoding Workers (auto-scaling)

       ├→ Worker 1: video_123 → 1080p.mp4 (H.264, 5 Mbps)
       ├→ Worker 2: video_123 → 720p.mp4  (H.264, 2.5 Mbps)
       ├→ Worker 3: video_123 → 480p.mp4  (H.264, 1 Mbps)
       ├→ Worker 4: video_123thumbnail.jpg (frame at 2s)
       └→ Worker 5: video_123HLS manifest (adaptive streaming)


  Upload variants to S3:
    s3://videos/123/1080p.mp4
    s3://videos/123/720p.mp4
    s3://videos/123/480p.mp4
    s3://videos/123/thumbnail.jpg
    s3://videos/123/manifest.m3u8


  Update metadata DB:
    video_123: status = "ready", variants = [1080p, 720p, 480p]

Key principles:
Always async (never block the upload response)
Queue decouples upload from processing
Workers auto-scale based on queue depth
Each variant is an independent job (parallelizable)

Adaptive Streaming

Instead of downloading a single file, adaptive streaming (HLS / DASH) splits the video into small segments (2-10 seconds each) at multiple quality levels. The player dynamically switches quality based on the user's bandwidth — seamless quality adjustment without buffering.

📬

Queue (Decoupling)

SQS, Kafka, or RabbitMQ sits between the upload event and workers. If workers are busy, jobs queue up instead of being dropped. Workers process at their own pace.

⚙️

Workers (Processing)

Stateless containers (ECS, Kubernetes) running FFmpeg or similar. Auto-scale based on queue depth: 100 pending jobs → spin up 10 workers. 0 jobs → scale to 0.

💾

Storage (Output)

Each variant is stored as a separate object in S3. The manifest file (HLS .m3u8) lists all available qualities. CDN caches each segment independently.

Benefits

  • Device compatibility (4K TV, phone, slow connection)
  • Bandwidth optimization (serve 480p on 3G, 1080p on WiFi)
  • Async processing — upload response is instant
  • Parallelizable — each variant is an independent job
  • Cost-efficient — process once, serve millions of times

Trade-offs

  • Processing cost (transcoding is CPU-intensive)
  • Latency — video isn't available until transcoding completes
  • Storage multiplication (3-5 variants per video)
  • Complexity — pipeline monitoring, failure handling, retries
  • Not real-time — minutes to hours for long videos

🎯 Interview Insight

Transcoding is always async. Say: "After upload, an event triggers a message queue. Workers pick up jobs and transcode into multiple resolutions. The user sees a 'processing' state until all variants are ready. Workers auto-scale based on queue depth. This decouples upload latency from processing time."

05

Deduplication with Hashing

Deduplication ensures that identical files are stored only once, regardless of how many users upload them. The key insight: if two files have the same content, they produce the same hash. Store the file once, reference it by hash.

Content-Addressable Deduplication — Flowtext
User A uploads: cat_meme.jpg (2.3 MB)

  1. Compute hash: SHA-256(file_bytes) = "a1b2c3d4e5f6..."
  2. Check storage: does object "a1b2c3d4e5f6..." exist?
NO: upload file to S3 as "a1b2c3d4e5f6..."
Store metadata: { user: A, filename: "cat_meme.jpg", hash: "a1b2c3..." }

User B uploads: funny_cat.jpg (same image, different filename)

  1. Compute hash: SHA-256(file_bytes) = "a1b2c3d4e5f6..." (same!)
  2. Check storage: does object "a1b2c3d4e5f6..." exist?
YES: skip upload (file already stored)
Store metadata: { user: B, filename: "funny_cat.jpg", hash: "a1b2c3..." }

Result:
  Storage: 1 copy of the file (2.3 MB, not 4.6 MB)
  Metadata: 2 entries pointing to the same hash
  Savings: 50% storage reduction for this file

At scale (Dropbox, Google Drive):
  Millions of users upload the same popular files
  Deduplication saves 30-60% of total storage
  That's petabytes of savings = millions of dollars

Implementation Approaches

ApproachHow It WorksDedup LevelBest For
Whole-file hashingHash entire file, compare hashFile-levelSimple, effective for exact duplicates
Chunk-level hashingSplit file into chunks, hash each chunkChunk-levelPartial duplicates (edited files share most chunks)
Client-side hashingClient computes hash before upload, server checksFile-levelSaves bandwidth (don't upload if already exists)
Client-Side Deduplication — Save Bandwidthtext
Without client-side dedup:
  Client uploads 500MB videoserver hashesduplicate found
  → 500MB wasted bandwidth, upload took 2 minutes for nothing

With client-side dedup:
  1. Client computes: SHA-256(file) = "a1b2c3..."
  2. Client asks server: POST /api/uploads/check { hash: "a1b2c3..." }
  3. Server checks: hash exists? → YES
  4. Server responds: { "exists": true, "file_id": "file_789" }
  5. Client skips upload entirelyjust links to existing file
0 bytes uploaded, instant "upload complete"

Dropbox uses this: "instant upload" for files that already exist
in any user's storage. The file never leaves the client's machine.

Benefits

  • 30-60% storage savings at scale
  • Reduced bandwidth (client-side dedup skips upload)
  • Content-addressable: hash IS the address (immutable, cacheable)
  • Natural integrity verification (hash = checksum)
  • Simplifies CDN caching (same content = same URL forever)

Trade-offs

  • Hash computation cost (SHA-256 on large files takes time)
  • Hash collision risk (astronomically rare with SHA-256, but non-zero)
  • Deletion complexity (can't delete a file if other users reference it)
  • Reference counting needed (track how many users point to each hash)
  • Privacy concern: knowing a hash exists reveals the file exists somewhere

🎯 Interview Insight

Deduplication is the answer to "how do you optimize storage for a file sharing system?" Say: "I'd use content-addressable storage — files are stored by their SHA-256 hash. Before uploading, the client sends the hash to check if the file already exists. If yes, we skip the upload entirely and just create a metadata reference. This saves 30-60% of storage at scale."

06

End-to-End Scenario

Let's design the file delivery system for a video sharing platform using all three patterns.

Video Platform — Upload to Deliverytext
UPLOAD:
  1. Client computes SHA-256 of video file
  2. Client: POST /api/uploads { hash: "a1b2c3...", size: 524MB, type: "video/mp4" }
  3. Server checks dedup: hash exists?
YES: skip upload, return existing file_id (instant!)
NO: generate presigned S3 upload URL
  4. Client uploads directly to S3 via presigned URL
     → 524MB goes to S3, not through backend
  5. S3 triggers eventmessage sent to transcoding queue

PROCESSING (async):
  6. Transcoding worker picks up job
  7. Downloads original from S3
  8. Transcodes: 1080p, 720p, 480p, thumbnail, HLS manifest
  9. Uploads all variants to S3:
     s3://videos/{hash}/1080p.mp4
     s3://videos/{hash}/720p.mp4
     s3://videos/{hash}/480p.mp4
     s3://videos/{hash}/thumb.jpg
     s3://videos/{hash}/manifest.m3u8
  10. Updates DB: video status = "ready"

DELIVERY:
  11. Client: GET /api/videos/456
  12. Backend: auth checkgenerate presigned CDN URL for 720p
  13. Client fetches from CDN:
CDN edge in user's region has it cached? → 10ms 
Cache miss? → CDN fetches from S3, caches, serves → 200ms
  14. Video player uses HLS manifest for adaptive streaming
Switches between 480p/720p/1080p based on bandwidth

DEDUPLICATION IN ACTION:
  User B uploads the same video (different title):
SHA-256 matchesskip upload entirely
Create new metadata entry pointing to same hash
Storage: 1 copy serves both users

💡 This Is How YouTube / Instagram Works

Upload directly to object storage (presigned URL), async transcoding pipeline (queue + workers), content-addressable deduplication (hash-based), CDN delivery (presigned URLs to edge). The backend never touches file bytes — it's purely a metadata and orchestration layer.

07

Trade-offs & Decision Making

DecisionOption AOption BChoose A WhenChoose B When
Delivery methodCDN + presigned URLsServe through backendAlways (at any meaningful scale)Never (backend becomes bottleneck)
Processing timingPre-transcode all variantsOn-demand transcodingPopular content, predictable formatsLong-tail content, many format combinations
DeduplicationContent-hash dedupStore every upload separatelyMany users upload similar content (social, file sharing)All content is unique (user-generated documents)
Hash computationClient-side hashingServer-side hashingSave bandwidth (skip duplicate uploads)Don't trust client (security-sensitive)

💰 Cost Considerations

  • CDN bandwidth: $0.02-0.08/GB (cheaper than origin)
  • S3 storage: $0.023/GB/month
  • Transcoding: $0.015-0.030 per minute of video
  • Dedup savings: 30-60% of storage costs
  • Pre-transcoding: higher upfront cost, lower serving cost

⚡ Performance Considerations

  • CDN cache hit: ~10ms (edge), miss: ~200ms (origin)
  • Presigned URL generation: ~5ms (backend)
  • Transcoding: minutes to hours (async, not user-facing)
  • Client-side hash: seconds (runs in browser/app)
  • Dedup check: ~1ms (hash lookup in DB/Redis)
08

Interview Questions

Q:Why use presigned URLs instead of serving files through the backend?

A: If the backend serves file bytes, every download consumes backend CPU, memory, and bandwidth. A 100MB video × 10K concurrent downloads = 1 TB of bandwidth through your backend servers. With presigned URLs, the backend generates a signed URL (~5ms, ~1KB response) and the client downloads directly from S3/CDN. The backend handles 0 bytes of file transfer. This is the difference between needing 2 backend servers and needing 200. Every file-heavy system (YouTube, Dropbox, Instagram) uses this pattern.

Q:How does a CDN improve file delivery performance?

A: A CDN caches files at edge servers worldwide. A user in Tokyo gets the file from a Tokyo edge (~10ms) instead of a Virginia origin (~150ms). Benefits: (1) Lower latency — files served from the nearest edge. (2) Reduced origin load — CDN absorbs 99%+ of download traffic. (3) Higher throughput — CDN has massive bandwidth capacity. (4) Reliability — if one edge is down, traffic routes to the next nearest. For large files, the CDN also handles range requests (resume interrupted downloads) and adaptive bitrate streaming.

Q:How do you design a video processing pipeline?

A: Always async. (1) Upload triggers an event (S3 notification). (2) Event goes to a message queue (SQS/Kafka). (3) Transcoding workers pick up jobs and produce multiple variants (1080p, 720p, 480p, thumbnail, HLS manifest). (4) Variants are stored in S3. (5) Metadata DB is updated with status='ready'. Workers auto-scale based on queue depth. Each variant is an independent job — parallelizable. The user sees 'processing' until all variants are ready. Never transcode synchronously — a 1-hour video takes minutes to transcode.

1

You're designing a file sharing system like Dropbox for 100M users

How do you handle storage efficiency?

Answer: Content-addressable deduplication. (1) Client computes SHA-256 of the file before upload. (2) Client sends hash to server: 'Does this file exist?' (3) If yes → skip upload, create metadata reference to existing file. Instant 'upload'. (4) If no → client uploads via presigned URL, file stored by hash. At 100M users, many files are duplicated (same documents, memes, videos). Dedup saves 30-60% of storage — at petabyte scale, that's millions of dollars. Chunk-level dedup (splitting files into 4MB chunks and deduplicating chunks) catches partial duplicates too — an edited document shares 95% of chunks with the original.

09

Common Pitfalls

🖥️

Serving files through the backend

The backend reads the file from storage and streams it to the client. Every download consumes backend CPU, memory, network bandwidth, and a thread/connection. At 1,000 concurrent downloads of 100MB files, the backend needs 100GB of bandwidth and hundreds of threads. It becomes the bottleneck and crashes.

Never serve file bytes through the backend. Generate presigned URLs and let the client download directly from S3 or CDN. The backend handles only auth and URL generation (~5ms, ~1KB). This is non-negotiable for any file-heavy system.

🌐

Not using a CDN

All downloads come from the origin server in one region. Users in Asia get 150ms+ latency for every request. The origin server's bandwidth is saturated. Adding more origin servers doesn't help — the latency is physical (speed of light).

Put a CDN in front of your object storage. CDN edges cache popular files worldwide. A user in Tokyo gets the file from a Tokyo edge in 10ms instead of 150ms from Virginia. CDN bandwidth is also cheaper than origin bandwidth. For video, CDN is not optional — it's required.

Synchronous processing of large files

The upload API endpoint transcodes the video before returning a response. A 1-hour video takes 10 minutes to transcode. The HTTP request times out at 30 seconds. The user sees an error. Even if it didn't timeout, the user waits 10 minutes staring at a spinner.

Always process large files asynchronously. Upload → return 202 Accepted immediately → trigger processing via queue → workers transcode in the background → update status when done. The client polls for status or receives a webhook/push notification when processing completes.

🗑️

Ignoring cache invalidation

A user updates their profile picture. The old image is cached on 200 CDN edges worldwide. The new image is in S3 but CDN keeps serving the old one for hours (until TTL expires). The user sees their old photo and thinks the upload failed.

Use content-hash URLs: instead of /images/user_42/avatar.jpg, use /images/user_42/avatar_a1b2c3.jpg where a1b2c3 is the content hash. When the image changes, the URL changes — CDN fetches the new file automatically. No cache invalidation needed. This is why every modern build tool puts content hashes in filenames.