Bloom FilterHyperLogLogCount-Min SketchProbabilisticBig DataStreamingApproximate

Probabilistic Data Structures

Master Bloom filters, HyperLogLog, and Count-Min Sketch — memory-efficient structures that trade accuracy for performance at massive scale.

24 min read9 sections
01

The Big Picture — Why Approximate?

At massive scale, exact answers become impossibly expensive. Counting the exact number of unique visitors to a website with 1 billion daily hits requires storing every single visitor ID — gigabytes of memory. But if you only need to know "roughly 50 million unique visitors, ±2%" — you can do it in 12 KB. That's the power of probabilistic data structures.

👥

The Crowd Estimation Analogy

You're at a concert and someone asks 'How many people are here?' You could count every single person — accurate but takes hours. Or you could estimate: count people in one section (200), count the sections (50), multiply: ~10,000 people. You're off by maybe 5%, but you got the answer in 10 seconds instead of 3 hours. Probabilistic data structures work the same way: they give you fast, memory-efficient answers that are 'close enough' for practical purposes.

🔥 Key Insight

Probabilistic data structures don't give wrong answers — they give approximate answers with known error bounds. A Bloom filter might say "maybe yes" when the answer is "no" (false positive), but it will never say "no" when the answer is "yes" (no false negatives). The error is controlled, predictable, and acceptable for the use case.

02

Overview & Trade-offs

🌫️

Bloom Filter

Answers: 'Is this element in the set?' Guarantees: no false negatives, possible false positives. Use: membership testing before expensive lookups.

🔢

HyperLogLog

Answers: 'How many unique elements are there?' Guarantees: ±2% accuracy in ~12 KB. Use: counting unique users, IPs, queries.

📊

Count-Min Sketch

Answers: 'How often does this element appear?' Guarantees: never underestimates, may overestimate. Use: frequency tracking, trending detection.

StructureQuestion It AnswersMemoryError TypeUse Case
Bloom FilterIs X in the set?~1 byte per elementFalse positives (says yes when no)Cache penetration, dedup, spell check
HyperLogLogHow many unique elements?~12 KB (fixed!)±2% count errorUnique visitors, distinct counts
Count-Min SketchHow often does X appear?Configurable (KB-MB)Overestimates frequencyTrending topics, rate limiting, heavy hitters
Exact SetIs X in the set?~50 bytes per elementNoneWhen precision is required
Exact CountHow many unique?~50 bytes per elementNoneWhen exact count is required

💡 When Approximation Is Good Enough

"We had 49,823,417 unique visitors yesterday" vs "We had ~50 million unique visitors yesterday." For a dashboard, both are equally useful. But the exact answer costs gigabytes of memory. The approximate answer costs 12 KB. If the business decision is the same either way, use the approximation.

03

Bloom Filters

A Bloom filter is a space-efficient structure that answers one question: "Is this element in the set?" It can say "definitely not" or "probably yes." It never gives false negatives — if it says no, the element is guaranteed to not be in the set.

📚

The Library Shortcut

Before walking to the library to check if they have a specific book, you call the front desk. The receptionist checks a quick reference list (not the full catalog). If they say 'We definitely don't have it' — trust them, don't bother going. If they say 'We might have it' — go check the actual shelves. The quick reference list occasionally says 'might have it' for books they don't (false positive), but it NEVER says 'don't have it' for books they do (no false negatives). That's a Bloom filter.

How It Works

1

Initialize a bit array

Start with an array of m bits, all set to 0. Choose k independent hash functions. Typical: m = 10 bits per element, k = 7 hash functions → ~1% false positive rate.

2

Add an element

Hash the element with all k hash functions. Each hash produces a position in the bit array. Set those k positions to 1. Example: add('alice') → hash1=3, hash2=7, hash3=12 → set bits 3, 7, 12 to 1.

3

Check membership

Hash the query with the same k functions. Check if ALL k positions are 1. If any position is 0 → element is DEFINITELY NOT in the set. If all positions are 1 → element is PROBABLY in the set (could be a false positive from other elements setting those bits).

Bloom Filter — Step by Steptext
Bit array (m=16): [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Hash functions: k=3

Add "alice":
  hash1("alice") = 3, hash2("alice") = 7, hash3("alice") = 12
  Bit array:       [0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0]

Add "bob":
  hash1("bob") = 1, hash2("bob") = 7, hash3("bob") = 14
  Bit array:       [0,1,0,1,0,0,0,1,0,0,0,0,1,0,1,0]

Check "alice":
  Positions 3,7,12all are 1"PROBABLY YES"

Check "carol":
  hash1("carol") = 1, hash2("carol") = 3, hash3("carol") = 9
  Positions 1,3 are 1, but position 9 is 0"DEFINITELY NO"

Check "dave":
  hash1("dave") = 1, hash2("dave") = 7, hash3("dave") = 14
  Positions 1,7,14all are 1"PROBABLY YES"
  But dave was never added! → FALSE POSITIVE ⚠️
  (Bits were set by alice and bob coincidentally)

Strengths

  • Extremely memory efficient (~1 byte per element for 1% FP rate)
  • O(k) lookup time (k hash functions, typically 3-7)
  • No false negatives — 'definitely not' is always correct
  • Perfect for: 'should I bother checking the database?'
  • Used in: Cassandra, Chrome (malicious URL check), CDNs

Limitations

  • False positives possible (says 'maybe yes' when answer is 'no')
  • Cannot delete elements (basic version — counting Bloom filters can)
  • Cannot enumerate elements (you can't list what's in the filter)
  • Size must be chosen upfront (can't grow dynamically)
  • False positive rate increases as the filter fills up

🎯 Real-World Use: Cache Penetration Prevention

Problem: attackers query for keys that don't exist in the cache OR the database. Every request bypasses the cache and hits the DB. Solution: put all valid keys in a Bloom filter. Before querying the DB, check the filter. If it says "definitely not" → return 404 immediately, no DB query. Saves millions of unnecessary database hits.

04

HyperLogLog

HyperLogLog (HLL) answers: "How many unique elements are in this set?" — also called cardinality estimation. It can count billions of unique elements using only ~12 KB of memory, with a typical error of ±2%.

🎲

The Coin Flip Intuition

Imagine flipping a coin and recording the longest streak of heads. If you flip 10 times and get 3 heads in a row, you probably flipped ~8 times (2³). If you get 7 heads in a row, you probably flipped ~128 times (2⁷). The longest streak gives you an estimate of how many flips happened. HyperLogLog works similarly: it hashes each element and looks at the longest run of leading zeros in the binary hash. More leading zeros = more unique elements were hashed. It uses multiple 'buckets' (like running the experiment many times) and averages the results for accuracy.

HyperLogLog — The Core Ideatext
For each element:
  1. Hash it to a binary string
  2. Count the leading zeros

hash("alice") = 00010110... → 3 leading zeros
hash("bob")   = 00000011... → 6 leading zeros
hash("carol") = 01001010... → 1 leading zero

Maximum leading zeros seen: 6
Estimated unique count2⁶ = 64

With 16,384 buckets (standard HLL):
  Each bucket tracks its own max leading zeros
  Final estimate = harmonic mean across all buckets
  Accuracy: ±0.81% with 12 KB of memory

Comparison:
  Exact unique count of 1 billion elements: ~8 GB (storing all IDs)
  HyperLogLog estimate of 1 billion elements: 12 KB
  Memory savings: 700,000x

Strengths

  • Fixed memory: ~12 KB regardless of cardinality
  • Counts billions of unique elements with ±2% error
  • Mergeable: combine HLLs from different servers
  • Built into Redis (PFADD, PFCOUNT, PFMERGE)
  • Used by: Google Analytics, Redis, Presto, Spark

Limitations

  • Approximate — not suitable when exact count is required
  • Cannot list the unique elements (only count them)
  • Cannot remove elements
  • Less accurate for very small cardinalities (<100)
  • Error is relative (±2% of 1 billion = ±20 million)
HyperLogLog in Redistext
# Add unique visitors
PFADD visitors:2024-01-15 "user_42" "user_99" "user_42" "user_7"

# Count unique visitors (user_42 counted once despite 2 adds)
PFCOUNT visitors:2024-01-15
3

# Merge multiple days for weekly count
PFMERGE visitors:week visitors:2024-01-15 visitors:2024-01-16 ... visitors:2024-01-21
PFCOUNT visitors:week
→ ~1,247,8322%)

# Memory used: 12 KB per key, regardless of cardinality
# Exact alternative: SET with all user IDs50 MB+ per day

🎯 Interview Insight

When asked "how would you count unique visitors at scale?" — say HyperLogLog. It's the standard answer. 12 KB per counter, ±2% accuracy, built into Redis. Mention that it's mergeable — you can count uniques per server and merge for the global count. This is how analytics systems work at scale.

05

Count-Min Sketch

Count-Min Sketch (CMS) answers: "How many times has this element appeared?" — frequency estimation. It never underestimates the true count but may overestimate it. Perfect for finding heavy hitters (most popular items) in a data stream.

📋

The Tally Board

Imagine tracking which songs are played most on a radio station. You have a small board with limited space — you can't track every song exactly. Instead, you use multiple rows of tally marks, each using a different grouping rule. When a song plays, you add a tally in each row at the position determined by that row's rule. To check a song's count, you look at all rows and take the minimum tally. The minimum is your best estimate — it might be slightly high (other songs share the same positions) but never too low.

Count-Min Sketch — How It Workstext
Structure: d rows × w columns (d hash functions, w counters per row)
Typical: d=5 rows, w=1000 columns

ADD "query: javascript tutorial" (increment count):
  hash1("javascript tutorial") % 1000 = 42row1[42]++
  hash2("javascript tutorial") % 1000 = 718row2[718]++
  hash3("javascript tutorial") % 1000 = 255row3[255]++
  hash4("javascript tutorial") % 1000 = 891row4[891]++
  hash5("javascript tutorial") % 1000 = 103row5[103]++

QUERY "javascript tutorial" (estimate count):
  row1[42]=147, row2[718]=152, row3[255]=148, row4[891]=147, row5[103]=149
  Estimate = min(147, 152, 148, 147, 149) = 147

  True count: 145
  Estimate: 147 (overestimate by 2 due to hash collisions)
Never underestimates. May slightly overestimate.

Real-World Use Cases

📈

Trending Topics

Track which search queries or hashtags are spiking. The top-K items by estimated frequency are the trending topics. Twitter, Google use this approach.

🚦

Rate Limiting

Track how many requests each IP has made in the last minute. If the estimated count exceeds the limit, block the IP. Memory-efficient for millions of IPs.

🎯

Heavy Hitter Detection

Find the most popular products, most active users, or most frequent error codes in a stream of events. No need to store every event — just the sketch.

Strengths

  • Fixed memory regardless of stream size
  • O(d) update and query time (d = number of hash functions)
  • Never underestimates — safe for rate limiting
  • Mergeable — combine sketches from different servers
  • Handles infinite streams (no need to store raw data)

Limitations

  • Overestimates frequency (hash collisions inflate counts)
  • Cannot list all elements (only query specific ones)
  • Cannot decrement counts (basic version)
  • Accuracy depends on sketch size (more memory = less error)
  • Not suitable when exact counts are required

🎯 Interview Insight

When asked "how would you find trending topics in real-time at scale?" — say Count-Min Sketch. Track frequency of each query/hashtag in a CMS. Periodically extract the top-K items by estimated frequency. Memory-efficient, handles infinite streams, and the overestimation bias is acceptable for trending detection.

06

End-to-End Scenario

Let's design the analytics and protection layer for a search engine processing 100,000 queries per second.

🔍 Search Engine — 100K QPS

Requirements: count unique users, find trending queries, prevent cache penetration.

Constraint: must handle the full stream in real-time with minimal memory.

1

Bloom Filter → Prevent Cache Penetration

Attackers query for nonsense terms that don't exist in the index. Without protection, every query bypasses the cache and hits the search backend. Solution: maintain a Bloom filter of all indexed terms (~100M terms, ~120 MB). Before searching, check the filter. 'Definitely not indexed' → return empty results instantly. Saves millions of backend queries per day.

2

HyperLogLog → Count Unique Users

Dashboard shows 'X unique searchers today.' Exact count would require storing every user ID (~500M users × 8 bytes = 4 GB per day). HyperLogLog: 12 KB per day, ±2% accuracy. PFADD daily_users:{date} {user_id} on every search. PFCOUNT for the dashboard. Merge daily HLLs for weekly/monthly uniques.

3

Count-Min Sketch → Trending Queries

'Trending searches' widget shows the top 20 queries spiking right now. CMS tracks frequency of every query in a 5-minute sliding window. Every 30 seconds, extract the top-K queries by estimated frequency. Memory: ~1 MB for the sketch. Handles 100K QPS without storing any raw query data.

Architecture — All Three Working Togethertext
User searches: "how to learn python"

1. Bloom Filter check:
"how to learn python" in indexed terms? → YES (probably)
Proceed to search backend

2. HyperLogLog update:
PFADD unique_users:2024-01-15 "user_42"
Dashboard: PFCOUNT → ~2,847,291 unique users today (12 KB)

3. Count-Min Sketch update:
Increment frequency of "how to learn python"
Trending check: estimated count = 8,472 in last 5 min
Qualifies for trending widget

Memory budget:
  Bloom filter (100M terms):     ~120 MB
  HyperLogLog (daily uniques):   12 KB
  Count-Min Sketch (trending):   ~1 MB
  Total: ~121 MB

Exact alternatives:
  Full term set:                 ~4 GB
  Full user ID set:              ~4 GB per day
  Full query frequency map:      ~10 GB
  Total: ~18 GB

Savings: 150x less memory for "good enough" answers.
07

Trade-offs & Decision Making

StructureQuestionError TypeMemoryCan Delete?Can Merge?
Bloom FilterIs X in the set?False positives~1 byte/elementNo (basic)Yes (OR bit arrays)
HyperLogLogHow many unique?±2% count12 KB (fixed)NoYes (PFMERGE)
Count-Min SketchHow often does X appear?OverestimatesConfigurableNo (basic)Yes (add matrices)

When to Use Each

Use CaseStructureWhy
Does this URL exist in our blocklist?Bloom FilterMembership check, false positive OK (double-check on hit)
How many unique visitors today?HyperLogLogCardinality estimation, ±2% is fine for dashboards
What are the top trending hashtags?Count-Min SketchFrequency estimation, overestimate OK for ranking
Has this user already seen this ad?Bloom FilterMembership check, showing an ad twice is OK (false positive)
How many distinct error codes this hour?HyperLogLogCardinality of error types, approximate is fine
Which IPs are exceeding rate limits?Count-Min SketchFrequency per IP, overestimate = stricter limiting (safe)

🎯 Decision Framework

"Is X in the set?" → Bloom Filter. "How many unique X?" → HyperLogLog. "How often does X appear?" → Count-Min Sketch. If you need exact answers and can afford the memory → use a HashSet or HashMap. If memory is the constraint → go probabilistic.

08

Interview Questions

Q:What is a Bloom filter and when would you use it?

A: A Bloom filter is a space-efficient probabilistic structure for membership testing. It uses a bit array and multiple hash functions. It can tell you 'definitely not in the set' (no false negatives) or 'probably in the set' (possible false positives). Use it to avoid expensive lookups: before querying the database, check the Bloom filter. If it says 'not in set', skip the DB query entirely. Used in Cassandra (check if an SSTable might contain a key), Chrome (check if a URL is in the malware list), and CDNs (check if content is cached).

Q:Why use HyperLogLog instead of exact counting?

A: Exact counting of unique elements requires storing every element — for 1 billion unique users, that's ~8 GB of memory. HyperLogLog estimates the same count in 12 KB with ±2% accuracy. That's a 700,000x memory reduction. For dashboards, analytics, and monitoring where 'approximately 50 million' is as useful as '49,823,417', HyperLogLog is the standard choice. It's built into Redis (PFADD/PFCOUNT), making it trivial to use.

Q:When would you use a Count-Min Sketch?

A: When you need to track how often items appear in a high-volume stream without storing every event. Examples: (1) Trending topics — track query frequency, extract top-K. (2) Rate limiting — track requests per IP, block when threshold exceeded. (3) Heavy hitter detection — find the most popular products or most active users. CMS never underestimates (safe for rate limiting) and uses fixed memory regardless of stream size. The trade-off: it may overestimate frequency due to hash collisions.

1

Your cache is being hammered by requests for keys that don't exist

How would you protect the database?

Answer: Use a Bloom filter containing all valid keys. Before querying the database on a cache miss, check the Bloom filter. If it says 'definitely not in set' → return 404 immediately without touching the DB. False positives (filter says 'maybe' for a non-existent key) still hit the DB, but the vast majority of invalid requests are blocked. For 100M valid keys, the filter uses ~120 MB and blocks 99%+ of invalid queries. This is the standard solution for cache penetration attacks.

2

You need to show 'X unique listeners' on a music streaming dashboard

The service has 500M daily active users. How do you count uniques efficiently?

Answer: HyperLogLog. For each song, maintain an HLL: PFADD song:{id}:listeners {user_id}. To show unique listeners: PFCOUNT song:{id}:listeners. Memory: 12 KB per song. For 50M songs: 50M × 12 KB = 600 GB — still large. Optimization: only maintain HLLs for songs played in the last 24 hours (~5M songs = 60 GB). For historical data, store the final PFCOUNT result in the database. Merge per-server HLLs for global counts.

09

Common Mistakes

🎯

Expecting exact results

Using a Bloom filter and being surprised by false positives, or using HyperLogLog and expecting an exact count. These structures are probabilistic by design — the error is a feature, not a bug. It's the price of using 1000x less memory.

Understand the error guarantees before choosing. Bloom filter: ~1% false positive rate (configurable). HyperLogLog: ±2% count error. Count-Min Sketch: overestimates by a configurable margin. If these errors are unacceptable for your use case, use exact data structures and pay the memory cost.

🏦

Using probabilistic structures where precision is critical

Using HyperLogLog to count money or Bloom filters to check account balances. Financial data, inventory counts, and security-critical checks require exact answers. A ±2% error on a bank balance is unacceptable.

Probabilistic structures are for analytics, monitoring, and optimization — not for transactional data. Use them for: 'how many unique visitors?' (dashboard), 'is this URL malicious?' (pre-filter, then verify). Never for: 'what is the account balance?' or 'how many items are in stock?'

🔀

Misunderstanding false positives

Thinking a Bloom filter's 'probably yes' means 'definitely yes.' Building critical logic on the positive result without a follow-up check. Example: using a Bloom filter to check if a username is taken, and rejecting the registration on a false positive — the user can never register that name.

Always follow up a Bloom filter positive with an exact check. The filter is a pre-filter: 'definitely no' → skip the expensive check. 'Probably yes' → do the expensive check to confirm. The value is in the 'definitely no' path — that's where you save work.

🔧

Choosing the wrong structure for the use case

Using a Bloom filter when you need frequency counts (use Count-Min Sketch), or using Count-Min Sketch when you need cardinality (use HyperLogLog). Each structure answers a specific question — using the wrong one gives meaningless results.

Match the question to the structure. 'Is X in the set?' → Bloom Filter. 'How many unique X?' → HyperLogLog. 'How often does X appear?' → Count-Min Sketch. If you need to answer multiple questions, use multiple structures — they're cheap enough to combine.