Web CrawlerGooglebotURL FrontierBloom FilterSimHashPolitenessDistributed SystemsInterview

Design a Web Crawler (Googlebot)

An end-to-end interview-ready walkthrough — from capacity estimation through deep dives on URL frontier design, deduplication, politeness, adaptive scheduling, and fault tolerance. Structured to mirror the arc of a 45-minute system design interview.

45 min read15 sections

Requirements

A web crawler is the backbone of every search engine — it discovers, downloads, and indexes the internet. The challenge isn't fetching one page; it's fetching billions while being polite, avoiding traps, detecting duplicates, and staying fresh. The requirements define whether you're building a weekend scraper or a Googlebot-class system.

✓

Functional Requirements

Core business logic & features

01.
Seed URL IngestionAccept a set of seed URLs and recursively discover new URLs by parsing fetched pages.
02.
HTML FetchingDownload web pages over HTTP/HTTPS, handling redirects, timeouts, and encoding.
03.
Link Extraction & NormalizationParse HTML to extract outgoing links, canonicalize URLs, and feed them back to the frontier.
04.
Robots.txt ComplianceFetch and respect robots.txt rules per domain — honor Disallow, Crawl-delay, and sitemap directives.
05.
Content StorageStore raw HTML and extracted metadata in durable object storage for downstream indexing.
06.
Duplicate DetectionDetect and skip already-seen URLs and near-duplicate page content to avoid wasted work.

⚡

Non-Functional

System constraints

Scale

1 billion pages/month — ~385 pages/sec sustained, with burst capacity to 1,000/sec.

Freshness

Full re-crawl of the known web every 2 weeks. High-priority pages re-crawled daily.

Politeness

Never overwhelm a single domain. Per-host rate limits, typically 1 req/sec or per robots.txt Crawl-delay.

Fault Tolerance

No single failure loses crawl progress. Resume from last checkpoint on crash.

🎯 Clarifying questions worth asking

Each one changes the architecture significantly:

Do we need to render JavaScript? (headless browser pool vs static HTML fetch — 10× cost difference)
Are we crawling the entire web or a specific domain? (single-domain is trivially different from internet-scale)
What's the freshness SLA? (news sites need hourly; corporate sites need weekly)
Do we store just URLs or full page content? (storage grows from GBs to PBs)
Multi-region or single datacenter? (geo-local crawling reduces latency to target servers)
Budget for DNS lookups? (at 385 URLs/sec, DNS becomes a bottleneck without caching)

In scope vs out of scope

In Scope	Out of Scope	Why
URL discovery + HTML fetch	Full-text indexing / ranking	Indexing is a separate system (inverted index, PageRank) — different interview problem
Robots.txt compliance	Legal compliance (GDPR, CCPA)	Legal is a policy layer, not an architecture decision
URL + content deduplication	Semantic duplicate detection (same article, different site)	Requires NLP — separate ML pipeline
Politeness / rate limiting	Anti-bot evasion (CAPTCHA solving, IP rotation)	Ethical crawlers don't evade — they respect signals
Adaptive re-crawl scheduling	Real-time change detection (WebSub, RSS polling)	Push-based freshness is a different architecture

💡 Interviewer signal

The strongest opening move is: "A crawler has two hard problems — politeness at scale and freshness under resource constraints. Everything else is plumbing." This frames the entire discussion around the two dimensions interviewers care about most.

1 / 15