Design a Web Crawler (Googlebot)
An end-to-end interview-ready walkthrough — from capacity estimation through deep dives on URL frontier design, deduplication, politeness, adaptive scheduling, and fault tolerance. Structured to mirror the arc of a 45-minute system design interview.
Requirements
A web crawler is the backbone of every search engine — it discovers, downloads, and indexes the internet. The challenge isn't fetching one page; it's fetching billions while being polite, avoiding traps, detecting duplicates, and staying fresh. The requirements define whether you're building a weekend scraper or a Googlebot-class system.
Functional Requirements
Core business logic & features
- 01.Seed URL IngestionAccept a set of seed URLs and recursively discover new URLs by parsing fetched pages.
- 02.HTML FetchingDownload web pages over HTTP/HTTPS, handling redirects, timeouts, and encoding.
- 03.Link Extraction & NormalizationParse HTML to extract outgoing links, canonicalize URLs, and feed them back to the frontier.
- 04.Robots.txt ComplianceFetch and respect robots.txt rules per domain — honor Disallow, Crawl-delay, and sitemap directives.
- 05.Content StorageStore raw HTML and extracted metadata in durable object storage for downstream indexing.
- 06.Duplicate DetectionDetect and skip already-seen URLs and near-duplicate page content to avoid wasted work.
Non-Functional
System constraints
Scale
1 billion pages/month — ~385 pages/sec sustained, with burst capacity to 1,000/sec.
Freshness
Full re-crawl of the known web every 2 weeks. High-priority pages re-crawled daily.
Politeness
Never overwhelm a single domain. Per-host rate limits, typically 1 req/sec or per robots.txt Crawl-delay.
Fault Tolerance
No single failure loses crawl progress. Resume from last checkpoint on crash.
🎯 Clarifying questions worth asking
Each one changes the architecture significantly:
- Do we need to render JavaScript? (headless browser pool vs static HTML fetch — 10× cost difference)
- Are we crawling the entire web or a specific domain? (single-domain is trivially different from internet-scale)
- What's the freshness SLA? (news sites need hourly; corporate sites need weekly)
- Do we store just URLs or full page content? (storage grows from GBs to PBs)
- Multi-region or single datacenter? (geo-local crawling reduces latency to target servers)
- Budget for DNS lookups? (at 385 URLs/sec, DNS becomes a bottleneck without caching)
In scope vs out of scope
| In Scope | Out of Scope | Why |
|---|---|---|
| URL discovery + HTML fetch | Full-text indexing / ranking | Indexing is a separate system (inverted index, PageRank) — different interview problem |
| Robots.txt compliance | Legal compliance (GDPR, CCPA) | Legal is a policy layer, not an architecture decision |
| URL + content deduplication | Semantic duplicate detection (same article, different site) | Requires NLP — separate ML pipeline |
| Politeness / rate limiting | Anti-bot evasion (CAPTCHA solving, IP rotation) | Ethical crawlers don't evade — they respect signals |
| Adaptive re-crawl scheduling | Real-time change detection (WebSub, RSS polling) | Push-based freshness is a different architecture |
💡 Interviewer signal
The strongest opening move is: "A crawler has two hard problems — politeness at scale and freshness under resource constraints. Everything else is plumbing." This frames the entire discussion around the two dimensions interviewers care about most.