Web CrawlerGooglebotURL FrontierBloom FilterSimHashPolitenessDistributed SystemsInterview

Design a Web Crawler (Googlebot)

An end-to-end interview-ready walkthrough — from capacity estimation through deep dives on URL frontier design, deduplication, politeness, adaptive scheduling, and fault tolerance. Structured to mirror the arc of a 45-minute system design interview.

45 min read15 sections
01

Requirements

A web crawler is the backbone of every search engine — it discovers, downloads, and indexes the internet. The challenge isn't fetching one page; it's fetching billions while being polite, avoiding traps, detecting duplicates, and staying fresh. The requirements define whether you're building a weekend scraper or a Googlebot-class system.

Functional Requirements

Core business logic & features

  • 01.
    Seed URL IngestionAccept a set of seed URLs and recursively discover new URLs by parsing fetched pages.
  • 02.
    HTML FetchingDownload web pages over HTTP/HTTPS, handling redirects, timeouts, and encoding.
  • 03.
    Link Extraction & NormalizationParse HTML to extract outgoing links, canonicalize URLs, and feed them back to the frontier.
  • 04.
    Robots.txt ComplianceFetch and respect robots.txt rules per domain — honor Disallow, Crawl-delay, and sitemap directives.
  • 05.
    Content StorageStore raw HTML and extracted metadata in durable object storage for downstream indexing.
  • 06.
    Duplicate DetectionDetect and skip already-seen URLs and near-duplicate page content to avoid wasted work.

Non-Functional

System constraints

Scale

1 billion pages/month — ~385 pages/sec sustained, with burst capacity to 1,000/sec.

Freshness

Full re-crawl of the known web every 2 weeks. High-priority pages re-crawled daily.

Politeness

Never overwhelm a single domain. Per-host rate limits, typically 1 req/sec or per robots.txt Crawl-delay.

Fault Tolerance

No single failure loses crawl progress. Resume from last checkpoint on crash.

🎯 Clarifying questions worth asking

Each one changes the architecture significantly:

  • Do we need to render JavaScript? (headless browser pool vs static HTML fetch — 10× cost difference)
  • Are we crawling the entire web or a specific domain? (single-domain is trivially different from internet-scale)
  • What's the freshness SLA? (news sites need hourly; corporate sites need weekly)
  • Do we store just URLs or full page content? (storage grows from GBs to PBs)
  • Multi-region or single datacenter? (geo-local crawling reduces latency to target servers)
  • Budget for DNS lookups? (at 385 URLs/sec, DNS becomes a bottleneck without caching)

In scope vs out of scope

In ScopeOut of ScopeWhy
URL discovery + HTML fetchFull-text indexing / rankingIndexing is a separate system (inverted index, PageRank) — different interview problem
Robots.txt complianceLegal compliance (GDPR, CCPA)Legal is a policy layer, not an architecture decision
URL + content deduplicationSemantic duplicate detection (same article, different site)Requires NLP — separate ML pipeline
Politeness / rate limitingAnti-bot evasion (CAPTCHA solving, IP rotation)Ethical crawlers don't evade — they respect signals
Adaptive re-crawl schedulingReal-time change detection (WebSub, RSS polling)Push-based freshness is a different architecture

💡 Interviewer signal

The strongest opening move is: "A crawler has two hard problems — politeness at scale and freshness under resource constraints. Everything else is plumbing." This frames the entire discussion around the two dimensions interviewers care about most.

1 / 15