Probabilistic Data Structures
Master Bloom filters, HyperLogLog, and Count-Min Sketch — memory-efficient structures that trade accuracy for performance at massive scale.
Table of Contents
The Big Picture — Why Approximate?
At massive scale, exact answers become impossibly expensive. Counting the exact number of unique visitors to a website with 1 billion daily hits requires storing every single visitor ID — gigabytes of memory. But if you only need to know "roughly 50 million unique visitors, ±2%" — you can do it in 12 KB. That's the power of probabilistic data structures.
The Crowd Estimation Analogy
You're at a concert and someone asks 'How many people are here?' You could count every single person — accurate but takes hours. Or you could estimate: count people in one section (200), count the sections (50), multiply: ~10,000 people. You're off by maybe 5%, but you got the answer in 10 seconds instead of 3 hours. Probabilistic data structures work the same way: they give you fast, memory-efficient answers that are 'close enough' for practical purposes.
🔥 Key Insight
Probabilistic data structures don't give wrong answers — they give approximate answers with known error bounds. A Bloom filter might say "maybe yes" when the answer is "no" (false positive), but it will never say "no" when the answer is "yes" (no false negatives). The error is controlled, predictable, and acceptable for the use case.
Overview & Trade-offs
Bloom Filter
Answers: 'Is this element in the set?' Guarantees: no false negatives, possible false positives. Use: membership testing before expensive lookups.
HyperLogLog
Answers: 'How many unique elements are there?' Guarantees: ±2% accuracy in ~12 KB. Use: counting unique users, IPs, queries.
Count-Min Sketch
Answers: 'How often does this element appear?' Guarantees: never underestimates, may overestimate. Use: frequency tracking, trending detection.
| Structure | Question It Answers | Memory | Error Type | Use Case |
|---|---|---|---|---|
| Bloom Filter | Is X in the set? | ~1 byte per element | False positives (says yes when no) | Cache penetration, dedup, spell check |
| HyperLogLog | How many unique elements? | ~12 KB (fixed!) | ±2% count error | Unique visitors, distinct counts |
| Count-Min Sketch | How often does X appear? | Configurable (KB-MB) | Overestimates frequency | Trending topics, rate limiting, heavy hitters |
| Exact Set | Is X in the set? | ~50 bytes per element | None | When precision is required |
| Exact Count | How many unique? | ~50 bytes per element | None | When exact count is required |
💡 When Approximation Is Good Enough
"We had 49,823,417 unique visitors yesterday" vs "We had ~50 million unique visitors yesterday." For a dashboard, both are equally useful. But the exact answer costs gigabytes of memory. The approximate answer costs 12 KB. If the business decision is the same either way, use the approximation.
Bloom Filters
A Bloom filter is a space-efficient structure that answers one question: "Is this element in the set?" It can say "definitely not" or "probably yes." It never gives false negatives — if it says no, the element is guaranteed to not be in the set.
The Library Shortcut
Before walking to the library to check if they have a specific book, you call the front desk. The receptionist checks a quick reference list (not the full catalog). If they say 'We definitely don't have it' — trust them, don't bother going. If they say 'We might have it' — go check the actual shelves. The quick reference list occasionally says 'might have it' for books they don't (false positive), but it NEVER says 'don't have it' for books they do (no false negatives). That's a Bloom filter.
How It Works
Initialize a bit array
Start with an array of m bits, all set to 0. Choose k independent hash functions. Typical: m = 10 bits per element, k = 7 hash functions → ~1% false positive rate.
Add an element
Hash the element with all k hash functions. Each hash produces a position in the bit array. Set those k positions to 1. Example: add('alice') → hash1=3, hash2=7, hash3=12 → set bits 3, 7, 12 to 1.
Check membership
Hash the query with the same k functions. Check if ALL k positions are 1. If any position is 0 → element is DEFINITELY NOT in the set. If all positions are 1 → element is PROBABLY in the set (could be a false positive from other elements setting those bits).
Bit array (m=16): [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] Hash functions: k=3 Add "alice": hash1("alice") = 3, hash2("alice") = 7, hash3("alice") = 12 Bit array: [0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0] Add "bob": hash1("bob") = 1, hash2("bob") = 7, hash3("bob") = 14 Bit array: [0,1,0,1,0,0,0,1,0,0,0,0,1,0,1,0] Check "alice": Positions 3,7,12 → all are 1 → "PROBABLY YES" ✅ Check "carol": hash1("carol") = 1, hash2("carol") = 3, hash3("carol") = 9 Positions 1,3 are 1, but position 9 is 0 → "DEFINITELY NO" ✅ Check "dave": hash1("dave") = 1, hash2("dave") = 7, hash3("dave") = 14 Positions 1,7,14 → all are 1 → "PROBABLY YES" But dave was never added! → FALSE POSITIVE ⚠️ (Bits were set by alice and bob coincidentally)
Strengths
- ✅Extremely memory efficient (~1 byte per element for 1% FP rate)
- ✅O(k) lookup time (k hash functions, typically 3-7)
- ✅No false negatives — 'definitely not' is always correct
- ✅Perfect for: 'should I bother checking the database?'
- ✅Used in: Cassandra, Chrome (malicious URL check), CDNs
Limitations
- ❌False positives possible (says 'maybe yes' when answer is 'no')
- ❌Cannot delete elements (basic version — counting Bloom filters can)
- ❌Cannot enumerate elements (you can't list what's in the filter)
- ❌Size must be chosen upfront (can't grow dynamically)
- ❌False positive rate increases as the filter fills up
🎯 Real-World Use: Cache Penetration Prevention
Problem: attackers query for keys that don't exist in the cache OR the database. Every request bypasses the cache and hits the DB. Solution: put all valid keys in a Bloom filter. Before querying the DB, check the filter. If it says "definitely not" → return 404 immediately, no DB query. Saves millions of unnecessary database hits.
HyperLogLog
HyperLogLog (HLL) answers: "How many unique elements are in this set?" — also called cardinality estimation. It can count billions of unique elements using only ~12 KB of memory, with a typical error of ±2%.
The Coin Flip Intuition
Imagine flipping a coin and recording the longest streak of heads. If you flip 10 times and get 3 heads in a row, you probably flipped ~8 times (2³). If you get 7 heads in a row, you probably flipped ~128 times (2⁷). The longest streak gives you an estimate of how many flips happened. HyperLogLog works similarly: it hashes each element and looks at the longest run of leading zeros in the binary hash. More leading zeros = more unique elements were hashed. It uses multiple 'buckets' (like running the experiment many times) and averages the results for accuracy.
For each element: 1. Hash it to a binary string 2. Count the leading zeros hash("alice") = 00010110... → 3 leading zeros hash("bob") = 00000011... → 6 leading zeros hash("carol") = 01001010... → 1 leading zero Maximum leading zeros seen: 6 Estimated unique count ≈ 2⁶ = 64 With 16,384 buckets (standard HLL): Each bucket tracks its own max leading zeros Final estimate = harmonic mean across all buckets Accuracy: ±0.81% with 12 KB of memory Comparison: Exact unique count of 1 billion elements: ~8 GB (storing all IDs) HyperLogLog estimate of 1 billion elements: 12 KB Memory savings: 700,000x
Strengths
- ✅Fixed memory: ~12 KB regardless of cardinality
- ✅Counts billions of unique elements with ±2% error
- ✅Mergeable: combine HLLs from different servers
- ✅Built into Redis (PFADD, PFCOUNT, PFMERGE)
- ✅Used by: Google Analytics, Redis, Presto, Spark
Limitations
- ❌Approximate — not suitable when exact count is required
- ❌Cannot list the unique elements (only count them)
- ❌Cannot remove elements
- ❌Less accurate for very small cardinalities (<100)
- ❌Error is relative (±2% of 1 billion = ±20 million)
# Add unique visitors PFADD visitors:2024-01-15 "user_42" "user_99" "user_42" "user_7" # Count unique visitors (user_42 counted once despite 2 adds) PFCOUNT visitors:2024-01-15 → 3 # Merge multiple days for weekly count PFMERGE visitors:week visitors:2024-01-15 visitors:2024-01-16 ... visitors:2024-01-21 PFCOUNT visitors:week → ~1,247,832 (±2%) # Memory used: 12 KB per key, regardless of cardinality # Exact alternative: SET with all user IDs → 50 MB+ per day
🎯 Interview Insight
When asked "how would you count unique visitors at scale?" — say HyperLogLog. It's the standard answer. 12 KB per counter, ±2% accuracy, built into Redis. Mention that it's mergeable — you can count uniques per server and merge for the global count. This is how analytics systems work at scale.
Count-Min Sketch
Count-Min Sketch (CMS) answers: "How many times has this element appeared?" — frequency estimation. It never underestimates the true count but may overestimate it. Perfect for finding heavy hitters (most popular items) in a data stream.
The Tally Board
Imagine tracking which songs are played most on a radio station. You have a small board with limited space — you can't track every song exactly. Instead, you use multiple rows of tally marks, each using a different grouping rule. When a song plays, you add a tally in each row at the position determined by that row's rule. To check a song's count, you look at all rows and take the minimum tally. The minimum is your best estimate — it might be slightly high (other songs share the same positions) but never too low.
Structure: d rows × w columns (d hash functions, w counters per row) Typical: d=5 rows, w=1000 columns ADD "query: javascript tutorial" (increment count): hash1("javascript tutorial") % 1000 = 42 → row1[42]++ hash2("javascript tutorial") % 1000 = 718 → row2[718]++ hash3("javascript tutorial") % 1000 = 255 → row3[255]++ hash4("javascript tutorial") % 1000 = 891 → row4[891]++ hash5("javascript tutorial") % 1000 = 103 → row5[103]++ QUERY "javascript tutorial" (estimate count): row1[42]=147, row2[718]=152, row3[255]=148, row4[891]=147, row5[103]=149 Estimate = min(147, 152, 148, 147, 149) = 147 True count: 145 Estimate: 147 (overestimate by 2 due to hash collisions) → Never underestimates. May slightly overestimate.
Real-World Use Cases
Trending Topics
Track which search queries or hashtags are spiking. The top-K items by estimated frequency are the trending topics. Twitter, Google use this approach.
Rate Limiting
Track how many requests each IP has made in the last minute. If the estimated count exceeds the limit, block the IP. Memory-efficient for millions of IPs.
Heavy Hitter Detection
Find the most popular products, most active users, or most frequent error codes in a stream of events. No need to store every event — just the sketch.
Strengths
- ✅Fixed memory regardless of stream size
- ✅O(d) update and query time (d = number of hash functions)
- ✅Never underestimates — safe for rate limiting
- ✅Mergeable — combine sketches from different servers
- ✅Handles infinite streams (no need to store raw data)
Limitations
- ❌Overestimates frequency (hash collisions inflate counts)
- ❌Cannot list all elements (only query specific ones)
- ❌Cannot decrement counts (basic version)
- ❌Accuracy depends on sketch size (more memory = less error)
- ❌Not suitable when exact counts are required
🎯 Interview Insight
When asked "how would you find trending topics in real-time at scale?" — say Count-Min Sketch. Track frequency of each query/hashtag in a CMS. Periodically extract the top-K items by estimated frequency. Memory-efficient, handles infinite streams, and the overestimation bias is acceptable for trending detection.
End-to-End Scenario
Let's design the analytics and protection layer for a search engine processing 100,000 queries per second.
🔍 Search Engine — 100K QPS
Requirements: count unique users, find trending queries, prevent cache penetration.
Constraint: must handle the full stream in real-time with minimal memory.
Bloom Filter → Prevent Cache Penetration
Attackers query for nonsense terms that don't exist in the index. Without protection, every query bypasses the cache and hits the search backend. Solution: maintain a Bloom filter of all indexed terms (~100M terms, ~120 MB). Before searching, check the filter. 'Definitely not indexed' → return empty results instantly. Saves millions of backend queries per day.
HyperLogLog → Count Unique Users
Dashboard shows 'X unique searchers today.' Exact count would require storing every user ID (~500M users × 8 bytes = 4 GB per day). HyperLogLog: 12 KB per day, ±2% accuracy. PFADD daily_users:{date} {user_id} on every search. PFCOUNT for the dashboard. Merge daily HLLs for weekly/monthly uniques.
Count-Min Sketch → Trending Queries
'Trending searches' widget shows the top 20 queries spiking right now. CMS tracks frequency of every query in a 5-minute sliding window. Every 30 seconds, extract the top-K queries by estimated frequency. Memory: ~1 MB for the sketch. Handles 100K QPS without storing any raw query data.
User searches: "how to learn python" 1. Bloom Filter check: → "how to learn python" in indexed terms? → YES (probably) → Proceed to search backend 2. HyperLogLog update: → PFADD unique_users:2024-01-15 "user_42" → Dashboard: PFCOUNT → ~2,847,291 unique users today (12 KB) 3. Count-Min Sketch update: → Increment frequency of "how to learn python" → Trending check: estimated count = 8,472 in last 5 min → Qualifies for trending widget Memory budget: Bloom filter (100M terms): ~120 MB HyperLogLog (daily uniques): 12 KB Count-Min Sketch (trending): ~1 MB Total: ~121 MB Exact alternatives: Full term set: ~4 GB Full user ID set: ~4 GB per day Full query frequency map: ~10 GB Total: ~18 GB Savings: 150x less memory for "good enough" answers.
Trade-offs & Decision Making
| Structure | Question | Error Type | Memory | Can Delete? | Can Merge? |
|---|---|---|---|---|---|
| Bloom Filter | Is X in the set? | False positives | ~1 byte/element | No (basic) | Yes (OR bit arrays) |
| HyperLogLog | How many unique? | ±2% count | 12 KB (fixed) | No | Yes (PFMERGE) |
| Count-Min Sketch | How often does X appear? | Overestimates | Configurable | No (basic) | Yes (add matrices) |
When to Use Each
| Use Case | Structure | Why |
|---|---|---|
| Does this URL exist in our blocklist? | Bloom Filter | Membership check, false positive OK (double-check on hit) |
| How many unique visitors today? | HyperLogLog | Cardinality estimation, ±2% is fine for dashboards |
| What are the top trending hashtags? | Count-Min Sketch | Frequency estimation, overestimate OK for ranking |
| Has this user already seen this ad? | Bloom Filter | Membership check, showing an ad twice is OK (false positive) |
| How many distinct error codes this hour? | HyperLogLog | Cardinality of error types, approximate is fine |
| Which IPs are exceeding rate limits? | Count-Min Sketch | Frequency per IP, overestimate = stricter limiting (safe) |
🎯 Decision Framework
"Is X in the set?" → Bloom Filter. "How many unique X?" → HyperLogLog. "How often does X appear?" → Count-Min Sketch. If you need exact answers and can afford the memory → use a HashSet or HashMap. If memory is the constraint → go probabilistic.
Interview Questions
Q:What is a Bloom filter and when would you use it?
A: A Bloom filter is a space-efficient probabilistic structure for membership testing. It uses a bit array and multiple hash functions. It can tell you 'definitely not in the set' (no false negatives) or 'probably in the set' (possible false positives). Use it to avoid expensive lookups: before querying the database, check the Bloom filter. If it says 'not in set', skip the DB query entirely. Used in Cassandra (check if an SSTable might contain a key), Chrome (check if a URL is in the malware list), and CDNs (check if content is cached).
Q:Why use HyperLogLog instead of exact counting?
A: Exact counting of unique elements requires storing every element — for 1 billion unique users, that's ~8 GB of memory. HyperLogLog estimates the same count in 12 KB with ±2% accuracy. That's a 700,000x memory reduction. For dashboards, analytics, and monitoring where 'approximately 50 million' is as useful as '49,823,417', HyperLogLog is the standard choice. It's built into Redis (PFADD/PFCOUNT), making it trivial to use.
Q:When would you use a Count-Min Sketch?
A: When you need to track how often items appear in a high-volume stream without storing every event. Examples: (1) Trending topics — track query frequency, extract top-K. (2) Rate limiting — track requests per IP, block when threshold exceeded. (3) Heavy hitter detection — find the most popular products or most active users. CMS never underestimates (safe for rate limiting) and uses fixed memory regardless of stream size. The trade-off: it may overestimate frequency due to hash collisions.
Your cache is being hammered by requests for keys that don't exist
How would you protect the database?
Answer: Use a Bloom filter containing all valid keys. Before querying the database on a cache miss, check the Bloom filter. If it says 'definitely not in set' → return 404 immediately without touching the DB. False positives (filter says 'maybe' for a non-existent key) still hit the DB, but the vast majority of invalid requests are blocked. For 100M valid keys, the filter uses ~120 MB and blocks 99%+ of invalid queries. This is the standard solution for cache penetration attacks.
You need to show 'X unique listeners' on a music streaming dashboard
The service has 500M daily active users. How do you count uniques efficiently?
Answer: HyperLogLog. For each song, maintain an HLL: PFADD song:{id}:listeners {user_id}. To show unique listeners: PFCOUNT song:{id}:listeners. Memory: 12 KB per song. For 50M songs: 50M × 12 KB = 600 GB — still large. Optimization: only maintain HLLs for songs played in the last 24 hours (~5M songs = 60 GB). For historical data, store the final PFCOUNT result in the database. Merge per-server HLLs for global counts.
Common Mistakes
Expecting exact results
Using a Bloom filter and being surprised by false positives, or using HyperLogLog and expecting an exact count. These structures are probabilistic by design — the error is a feature, not a bug. It's the price of using 1000x less memory.
✅Understand the error guarantees before choosing. Bloom filter: ~1% false positive rate (configurable). HyperLogLog: ±2% count error. Count-Min Sketch: overestimates by a configurable margin. If these errors are unacceptable for your use case, use exact data structures and pay the memory cost.
Using probabilistic structures where precision is critical
Using HyperLogLog to count money or Bloom filters to check account balances. Financial data, inventory counts, and security-critical checks require exact answers. A ±2% error on a bank balance is unacceptable.
✅Probabilistic structures are for analytics, monitoring, and optimization — not for transactional data. Use them for: 'how many unique visitors?' (dashboard), 'is this URL malicious?' (pre-filter, then verify). Never for: 'what is the account balance?' or 'how many items are in stock?'
Misunderstanding false positives
Thinking a Bloom filter's 'probably yes' means 'definitely yes.' Building critical logic on the positive result without a follow-up check. Example: using a Bloom filter to check if a username is taken, and rejecting the registration on a false positive — the user can never register that name.
✅Always follow up a Bloom filter positive with an exact check. The filter is a pre-filter: 'definitely no' → skip the expensive check. 'Probably yes' → do the expensive check to confirm. The value is in the 'definitely no' path — that's where you save work.
Choosing the wrong structure for the use case
Using a Bloom filter when you need frequency counts (use Count-Min Sketch), or using Count-Min Sketch when you need cardinality (use HyperLogLog). Each structure answers a specific question — using the wrong one gives meaningless results.
✅Match the question to the structure. 'Is X in the set?' → Bloom Filter. 'How many unique X?' → HyperLogLog. 'How often does X appear?' → Count-Min Sketch. If you need to answer multiple questions, use multiple structures — they're cheap enough to combine.