Consistency & Consensus
Deep dive into distributed systems theory — CAP theorem, PACELC model, strong vs eventual consistency, and linearizability. Understand how systems behave under failure.
Table of Contents
The Big Picture — Why Distributed Systems Are Hard
A single database on a single machine is simple. Every read sees the latest write. There's one copy of the data, one source of truth. But a single machine can fail, can't handle millions of requests, and is in one geographic location. So we distribute — we replicate data across multiple machines in multiple locations. And that's where everything gets hard.
The Bank Branches Analogy
Imagine a bank with 3 branches in different cities. Each branch has a copy of your account balance. You deposit $500 at the Mumbai branch. At the exact same moment, your spouse withdraws $300 at the Delhi branch. Both branches have the old balance ($1,000). Mumbai updates to $1,500. Delhi updates to $700. Which one is correct? Neither — the correct balance is $1,200. This is the fundamental problem of distributed systems: when multiple copies of data exist, and updates happen at different places at the same time, how do you keep everything consistent?
The core tension: we replicate data for reliability and performance, but replication introduces the possibility of conflicting updates, stale reads, and disagreement between nodes. Consistency and consensus are the tools we use to manage this tension.
🔥 Key Insight
Distributed systems are hard not because the algorithms are complex, but because the physical world is unreliable. Networks drop packets. Machines crash. Clocks drift. Any solution must work correctly despite these failures — and that's what makes it fundamentally different from single-machine programming.
Consistency & Consensus Overview
Consistency
All nodes see the same data at the same time. When you write a value, every subsequent read (from any node) returns that value. The question: how strict is 'same time'?
Consensus
All nodes agree on a single value or decision. Even if some nodes fail or messages are delayed, the remaining nodes reach agreement. The mechanism that enables consistency.
Why We Replicate Data
Fault Tolerance
If one machine dies, the data still exists on other machines. No single point of failure. A 3-replica system survives 2 machine failures.
Read Performance
Multiple replicas can serve reads in parallel. A user in Tokyo reads from a Tokyo replica instead of a Virginia server. Lower latency, higher throughput.
Geographic Distribution
Data close to users worldwide. A global service needs replicas in multiple regions so users don't wait for cross-continent round trips.
Problems Introduced by Replication
⚔️ Conflicts
Two users update the same data on different replicas simultaneously. Which update wins? How do you merge them?
⏳ Delays
Replication takes time. A write to Node A hasn't reached Node B yet. A read from Node B returns stale data.
💥 Failures
A network partition splits nodes into groups that can't communicate. Each group might accept conflicting writes.
💡 The Fundamental Trade-off
You can't have it all. Stronger consistency requires more coordination between nodes, which means higher latency and lower availability. Weaker consistency allows faster responses and higher availability, but clients may see stale or conflicting data. Every distributed system sits somewhere on this spectrum.
CAP Theorem — Deep Dive
The CAP theorem states that a distributed system can provide at most two of three guarantees simultaneously: Consistency, Availability, and Partition Tolerance. Since network partitions are inevitable in any distributed system, the real choice is between Consistency and Availability during a partition.
Consistency (C)
Every read receives the most recent write or an error. All nodes see the same data at the same time. If you write 'balance = $500', every node immediately returns $500.
Availability (A)
Every request receives a response (not an error), even if it might not contain the most recent write. The system always responds, even if the data might be stale.
Partition Tolerance (P)
The system continues to operate despite network partitions — when some nodes can't communicate with others. Messages between nodes are lost or delayed.
Bank Branches During a Storm
Your bank has branches in Mumbai and Delhi. A storm cuts the network between them (partition). A customer in Mumbai deposits $500. Now the bank has a choice: CP (Consistency + Partition Tolerance): The Delhi branch refuses to serve any requests until it can sync with Mumbai. 'Sorry, system is down.' Customers are frustrated, but no one sees wrong data. AP (Availability + Partition Tolerance): The Delhi branch keeps serving requests with the last known balance. Customers get responses, but the balance might be stale. When the network recovers, the branches reconcile. You can't have CA during a partition — that would mean both branches show the correct balance AND respond to every request, which is impossible when they can't communicate.
Why Partition Tolerance Is Non-Negotiable
In any real distributed system, network partitions will happen. Cables get cut. Routers fail. Cloud availability zones lose connectivity. You can't choose "no partitions" — you can only choose how to behave when they occur. This means the real choice is always CP or AP.
| System Type | During Partition | Trade-off | Examples |
|---|---|---|---|
| CP (Consistent + Partition Tolerant) | Rejects requests until partition heals | Sacrifices availability for correctness | HBase, MongoDB (strong read), Zookeeper, etcd |
| AP (Available + Partition Tolerant) | Serves requests with potentially stale data | Sacrifices consistency for availability | Cassandra, DynamoDB, CouchDB, DNS |
Network partition splits the cluster: [Node A, Node B] ←— can't talk to —→ [Node C] Client writes "balance = $500" to Node A. Node A and B are updated. Node C still has the old value ($1000). CP System (e.g., HBase): → Node C stops accepting reads/writes → Returns error: "Service unavailable" → When partition heals, Node C syncs and becomes available → Guarantee: no client ever sees stale data AP System (e.g., Cassandra): → Node C keeps accepting reads/writes → Returns old value: "balance = $1000" (stale!) → When partition heals, conflict resolution merges the data → Guarantee: every client always gets a response
🎯 Interview Insight
Don't say "I'd choose CA." In a distributed system, partitions are inevitable, so CA is not a real option. The correct framing: "During a partition, I'd choose CP for financial data (correctness matters more) and AP for social feeds (availability matters more)." Different parts of the same system can make different choices.
PACELC Model
CAP only describes behavior during partitions. But what about normal operation — when there's no partition? That's where PACELC comes in. It extends CAP to cover the full picture.
If there is a Partition (P): → Choose between Availability (A) and Consistency (C) Else (E) — normal operation, no partition: → Choose between Latency (L) and Consistency (C) Full notation: PA/EL, PC/EC, PA/EC, PC/EL Examples: Cassandra → PA/EL (Available during partition, low latency normally) DynamoDB → PA/EL (Same — availability and speed first) HBase → PC/EC (Consistent during partition, consistent normally) MongoDB → PC/EC (Strong consistency by default) Cosmos DB → Configurable (you choose per query!)
Why PACELC Is More Realistic Than CAP
CAP only matters during partitions — which are rare (minutes per year). PACELC captures the trade-off you face every single request: do you wait for all replicas to confirm (consistency, higher latency) or respond from the nearest replica immediately (low latency, possibly stale)?
⚡ EL — Else Latency (Speed First)
- Read from the nearest replica immediately
- Don't wait for all replicas to agree
- Response in 1-5ms (local replica)
- Risk: data might be slightly stale
- Use case: social feeds, product catalogs, caching
🔒 EC — Else Consistency (Correctness First)
- Wait for majority of replicas to confirm
- Every read sees the latest write
- Response in 10-100ms (coordination overhead)
- Guarantee: no stale reads
- Use case: banking, inventory, bookings
| Database | During Partition (P) | Normal Operation (E) | PACELC |
|---|---|---|---|
| Cassandra | Available (A) | Low Latency (L) | PA/EL |
| DynamoDB | Available (A) | Low Latency (L) | PA/EL |
| MongoDB | Consistent (C) | Consistent (C) | PC/EC |
| HBase | Consistent (C) | Consistent (C) | PC/EC |
| Cosmos DB | Configurable | Configurable | Tunable |
| PostgreSQL (single) | N/A (not distributed) | Consistent (C) | EC |
🎯 Interview Insight
When discussing database choices, use PACELC instead of just CAP. Say: "Cassandra is PA/EL — it prioritizes availability during partitions and low latency during normal operation. That makes it ideal for our social feed where stale data is acceptable but downtime is not." This shows deeper understanding than just saying "Cassandra is AP."
Strong vs Eventual Consistency
This is the most practical trade-off you'll face in system design. Strong consistency means every read returns the latest write. Eventual consistency means reads might return stale data, but all replicas will converge over time.
The Shared Document Analogy
Strong consistency is like Google Docs with real-time sync — every collaborator sees every keystroke instantly. If Alice types 'Hello', Bob sees 'Hello' immediately. This requires constant communication between all participants. Eventual consistency is like emailing a document back and forth. Alice edits her copy, Bob edits his copy. Eventually they merge changes. For a while, they have different versions. But they'll converge. Faster to work independently, but you might have conflicts.
🔒 Strong Consistency
- Every read returns the most recent write
- All nodes agree before acknowledging a write
- Higher latency (coordination overhead)
- Lower availability during partitions
- Simpler application logic (no stale data handling)
⏳ Eventual Consistency
- Reads might return stale data temporarily
- Write acknowledged after one replica confirms
- Lower latency (no coordination needed)
- Higher availability (always responds)
- Complex application logic (handle stale reads, conflicts)
Scenario: User updates their profile name from "Alice" to "Alicia" Strong Consistency: 1. Write "Alicia" → wait for all replicas to confirm 2. Any subsequent read from ANY server → "Alicia" 3. User refreshes page → sees "Alicia" immediately 4. Latency: 50-200ms (coordination across replicas) Eventual Consistency: 1. Write "Alicia" → acknowledged after 1 replica confirms 2. Read from same server → "Alicia" ✅ 3. Read from different server → might still show "Alice" ⏳ 4. After 100ms-2s, all replicas converge → "Alicia" everywhere 5. Latency: 5-20ms (no coordination) User experience impact: → User updates name, refreshes, sees old name briefly → "Did my update not save?" → confusion → Fix: read-your-own-writes consistency (read from the replica you wrote to)
When Is Eventual Consistency Acceptable?
| Data Type | Consistency Needed | Why |
|---|---|---|
| Bank balance | Strong | Showing wrong balance → overdraft, fraud, legal issues |
| Inventory count | Strong | Stale count → overselling, angry customers |
| Social media likes | Eventual | Showing 1,023 instead of 1,024 likes → nobody notices |
| News feed | Eventual | Seeing a post 2 seconds late → acceptable |
| Shopping cart | Session | User must see their own updates; other users don't matter |
| Leaderboard | Eventual | Scores updating with a few seconds delay → fine |
| Flight booking | Strong | Double-booking a seat → catastrophic |
⚠️ Stale Reads — The Hidden UX Problem
The most common user complaint with eventual consistency: "I just updated my profile but it still shows the old name." Fix: use "read-your-own-writes" consistency — route the user's reads to the same replica they wrote to. Other users can see eventual consistency; the writing user sees strong consistency for their own data.
Linearizability
Linearizability is the strongest form of consistency. It guarantees that all operations appear to execute in a single, global order — as if there's only one copy of the data, even though it's replicated across multiple nodes.
The Billing Counter Queue
Imagine a single billing counter at a store. Customers are served in the exact order they arrive. If Alice arrives before Bob, Alice is served first — always. There's no ambiguity about the order. Linearizability is this queue applied to a distributed system. Even though there are multiple replicas (multiple counters), operations appear as if they went through a single counter in a strict order. If operation A completes before operation B starts, then every observer sees A's effect before B's.
Linearizability vs Eventual Consistency
Timeline: T1: Client A writes x = 1 T2: Client A gets acknowledgment (write complete) T3: Client B reads x Linearizable: Client B reads x → MUST return 1 Because the write completed (T2) before the read started (T3), the read must see the write's effect. No exceptions. Eventually Consistent: Client B reads x → might return 0 (old value) The write happened, but the replica Client B is reading from hasn't received the update yet. It will... eventually. Linearizability = real-time ordering guarantee Eventual consistency = no ordering guarantee (just convergence)
When linearizability is required
- ✅Leader election — all nodes must agree on who the leader is
- ✅Distributed locks — only one process holds the lock at a time
- ✅Unique constraints — no two users can register the same username
- ✅Financial transactions — account balance must be globally consistent
- ✅Coordination services — Zookeeper, etcd rely on linearizability
Why linearizability is expensive
- ❌Requires coordination on every operation (consensus protocol)
- ❌Latency increases with geographic distance between replicas
- ❌Throughput is limited by the slowest replica in the quorum
- ❌Availability drops during partitions (CP trade-off)
- ❌Most applications don't actually need it for most data
🎯 Interview Insight
Don't say "we need linearizability everywhere." It's expensive and rarely needed for all data. Say: "We need linearizability for the lock service and unique username checks, but eventual consistency is fine for the news feed and like counts." This shows you understand the cost and apply it surgically.
End-to-End Scenario
Let's design the consistency model for two real systems — a banking platform and a social media platform — and see how the same concepts lead to completely different decisions.
🏦 Banking Platform
CAP choice: CP — during a partition, reject transactions rather than risk incorrect balances. A user seeing "service unavailable" is better than seeing a wrong balance and overdrawing.
PACELC: PC/EC — even during normal operation, we choose consistency over latency. A transfer taking 200ms instead of 20ms is acceptable. A wrong balance is not.
Consistency model: Strong consistency (linearizable for balance updates). Every read must return the latest balance. Use a consensus protocol (Raft/Paxos) for the transaction log.
User experience: Slightly higher latency (50-200ms per transaction). Occasional "service unavailable" during rare partitions. But the balance is always correct. Users trust the system.
Database: PostgreSQL with synchronous replication, or CockroachDB (distributed SQL with strong consistency).
📱 Social Media Platform
CAP choice: AP — during a partition, keep serving feeds and accepting posts. A user seeing a slightly stale feed is better than seeing "service unavailable." Social media must always be available.
PACELC: PA/EL — even during normal operation, we choose latency over consistency. The feed loading in 20ms with slightly stale data is better than 200ms with perfectly fresh data.
Consistency model: Eventual consistency for feeds, likes, and comments. Read-your-own-writes for the posting user (they should see their own post immediately). Strong consistency only for account operations (login, password change).
User experience: Ultra-fast feed loading. A new post might take 1-2 seconds to appear for other users. Like counts might be briefly inaccurate. Nobody notices or cares.
Database: Cassandra for feeds and activity data (PA/EL). PostgreSQL for user accounts (PC/EC). Redis for real-time counters (eventual, fast).
🔥 The Key Takeaway
The same system often uses different consistency models for different data. A social media platform uses strong consistency for passwords and eventual consistency for likes. A banking platform uses strong consistency for balances but might use eventual consistency for transaction history display. Match the consistency model to the data's requirements, not to the system as a whole.
Trade-offs & Decision Making
Consistency vs Availability
| Dimension | Choose Consistency | Choose Availability |
|---|---|---|
| During partition | Reject requests (errors) | Serve stale data |
| User experience | Occasional downtime | Always responsive |
| Data correctness | Always correct | Temporarily stale |
| Use when | Wrong data causes real harm | Downtime causes real harm |
| Examples | Banking, inventory, bookings | Social media, CDN, DNS |
Latency vs Correctness
| Dimension | Choose Low Latency | Choose Correctness |
|---|---|---|
| Read path | Read from nearest replica (5ms) | Read from leader or quorum (50-200ms) |
| Write path | Acknowledge after 1 replica (5ms) | Acknowledge after majority (50-200ms) |
| Stale reads | Possible (ms to seconds) | Never |
| Throughput | Higher (no coordination) | Lower (coordination overhead) |
| Use when | Speed > accuracy (feeds, caching) | Accuracy > speed (finance, inventory) |
Complexity vs Guarantees
| Dimension | Simple (Eventual) | Complex (Strong/Linearizable) |
|---|---|---|
| Implementation | Write to one replica, async replication | Consensus protocol (Raft, Paxos) |
| Failure handling | Conflict resolution after the fact | Prevent conflicts before they happen |
| Application code | Must handle stale reads, conflicts | Simpler (data is always correct) |
| Operational cost | Lower (fewer coordination failures) | Higher (consensus can stall) |
| Debugging | Harder (non-deterministic behavior) | Easier (deterministic ordering) |
🎯 Decision Framework
For each piece of data, ask: "What's the cost of a stale read?" If the answer is "user sees 1,023 likes instead of 1,024" → eventual consistency. If the answer is "user sees $1,000 instead of $500 and withdraws money they don't have" → strong consistency. The cost of inconsistency determines the consistency model.
Interview Questions
Conceptual, scenario-based, and trick questions you're likely to encounter.
Q:Can you achieve CA (Consistency + Availability) in a distributed system?
A: Not during a network partition. The CAP theorem proves that when nodes can't communicate (partition), you must choose: respond with potentially stale data (A) or refuse to respond until the partition heals (C). However, when there's no partition (which is most of the time), you CAN have both consistency and availability — that's the 'E' in PACELC. A single-node database (non-distributed) is effectively CA, but it's not partition-tolerant because it's a single point of failure.
Q:What is eventual consistency?
A: A consistency model where, after a write, not all replicas are updated immediately. For a brief period, different replicas may return different values. If no new writes occur, all replicas will eventually converge to the same value. 'Eventually' is typically milliseconds to seconds, not hours. The key insight: eventual consistency is not 'no consistency' — it's a guarantee that the system WILL converge. It's a deliberate trade-off for lower latency and higher availability.
Q:When would you choose AP over CP?
A: When availability is more valuable than perfect consistency. Examples: (1) Social media feeds — users would rather see a slightly stale feed than a 'service unavailable' page. (2) DNS — returning a cached (possibly stale) IP is better than failing to resolve. (3) Shopping cart — losing a cart update is bad, but the entire site being down is worse. (4) CDN — serving a cached page is better than no page. The common thread: the cost of brief inconsistency is lower than the cost of downtime.
You're designing a distributed counter for video view counts
What consistency model would you use?
Answer: Eventual consistency. View counts don't need to be perfectly accurate in real-time. If the count shows 1,000,042 instead of 1,000,045 for a few seconds, nobody notices. Use a PA/EL system (Cassandra or Redis) where each node increments its local counter and periodically syncs. This gives you ultra-low latency writes (critical for high-traffic videos) and high availability. The count converges within seconds.
Two users try to register the same username simultaneously
What consistency guarantee do you need?
Answer: Linearizability. The username uniqueness check must be globally ordered — if User A's registration completes before User B's starts, User B must see that the username is taken. With eventual consistency, both users could check simultaneously, both see the username as available, and both register — violating the uniqueness constraint. Use a linearizable store (etcd, Zookeeper) or a single-leader database with a UNIQUE constraint for this specific operation.
Your globally distributed database has replicas in US, EU, and Asia
A user in Asia writes data. How quickly should a user in the US see it?
Answer: It depends on the consistency model. Strong consistency: the US user sees it immediately (but the write takes ~200ms because it must propagate to a quorum across continents). Eventual consistency: the write completes in ~5ms (local replica), but the US user might see stale data for 100-500ms until replication catches up. For most applications, eventual consistency with read-your-own-writes is the sweet spot — the writing user sees their update immediately, other users see it within a second.
Common Mistakes
These misconceptions trip up engineers in interviews and lead to bad architectural decisions.
Misunderstanding CAP theorem
Saying 'our system is CA' or 'we chose all three.' CAP says you can't have all three DURING A PARTITION. Many people also think CAP means you permanently give up one property — it's actually about behavior during the (rare) partition event.
✅Frame it correctly: 'During a partition, we choose consistency over availability for financial data (CP), and availability over consistency for the news feed (AP). When there's no partition, we have both.' Also mention PACELC for the normal-operation trade-off.
Assuming strong consistency is always better
Defaulting to strong consistency for everything because 'correctness is important.' Strong consistency has real costs: higher latency (50-200ms vs 5ms), lower throughput (coordination overhead), and lower availability during partitions. For most data, these costs aren't justified.
✅Apply strong consistency surgically — only where the cost of a stale read is unacceptable (money, inventory, bookings). Use eventual consistency for everything else (feeds, likes, analytics, logs). Most data in most systems is fine with eventual consistency.
Ignoring latency trade-offs
Focusing only on the CAP partition scenario and ignoring that the latency vs consistency trade-off (PACELC's 'E') affects EVERY request, not just during rare partitions. A system that's 10x slower for perfect consistency might be worse than one that's fast with occasional stale reads.
✅Use PACELC thinking. Ask: 'During normal operation, do we want 5ms reads (eventual) or 50ms reads (strong)?' For a feed that loads 100 times per second, that 45ms difference is the difference between a snappy app and a sluggish one.
Confusing linearizability with strong consistency
Using 'strong consistency' and 'linearizability' interchangeably. Linearizability is a specific, formal guarantee: operations appear in a single global order that respects real-time. Strong consistency is a broader term that can mean different things in different contexts (sequential consistency, causal consistency, etc.).
✅Be precise: linearizability = strongest guarantee (real-time ordering). Sequential consistency = operations appear in some total order (but not necessarily real-time). Causal consistency = causally related operations are ordered. In interviews, say 'linearizable' when you mean the strongest guarantee, not just 'strongly consistent.'
Applying one consistency model to the entire system
Saying 'our system uses eventual consistency' or 'our system is CP.' Real systems use different consistency models for different data. A social media platform uses strong consistency for passwords, eventual consistency for feeds, and linearizability for username uniqueness.
✅Design consistency per data type, not per system. Map each piece of data to its consistency requirement based on the cost of a stale read. This is how every large-scale system actually works.