CAP TheoremPACELCConsistencyConsensusLinearizabilityDistributed Systems

Consistency & Consensus

Deep dive into distributed systems theory — CAP theorem, PACELC model, strong vs eventual consistency, and linearizability. Understand how systems behave under failure.

28 min read10 sections
01

The Big Picture — Why Distributed Systems Are Hard

A single database on a single machine is simple. Every read sees the latest write. There's one copy of the data, one source of truth. But a single machine can fail, can't handle millions of requests, and is in one geographic location. So we distribute — we replicate data across multiple machines in multiple locations. And that's where everything gets hard.

🏦

The Bank Branches Analogy

Imagine a bank with 3 branches in different cities. Each branch has a copy of your account balance. You deposit $500 at the Mumbai branch. At the exact same moment, your spouse withdraws $300 at the Delhi branch. Both branches have the old balance ($1,000). Mumbai updates to $1,500. Delhi updates to $700. Which one is correct? Neither — the correct balance is $1,200. This is the fundamental problem of distributed systems: when multiple copies of data exist, and updates happen at different places at the same time, how do you keep everything consistent?

The core tension: we replicate data for reliability and performance, but replication introduces the possibility of conflicting updates, stale reads, and disagreement between nodes. Consistency and consensus are the tools we use to manage this tension.

🔥 Key Insight

Distributed systems are hard not because the algorithms are complex, but because the physical world is unreliable. Networks drop packets. Machines crash. Clocks drift. Any solution must work correctly despite these failures — and that's what makes it fundamentally different from single-machine programming.

02

Consistency & Consensus Overview

🔄

Consistency

All nodes see the same data at the same time. When you write a value, every subsequent read (from any node) returns that value. The question: how strict is 'same time'?

🤝

Consensus

All nodes agree on a single value or decision. Even if some nodes fail or messages are delayed, the remaining nodes reach agreement. The mechanism that enables consistency.

Why We Replicate Data

🛡️

Fault Tolerance

If one machine dies, the data still exists on other machines. No single point of failure. A 3-replica system survives 2 machine failures.

Read Performance

Multiple replicas can serve reads in parallel. A user in Tokyo reads from a Tokyo replica instead of a Virginia server. Lower latency, higher throughput.

🌍

Geographic Distribution

Data close to users worldwide. A global service needs replicas in multiple regions so users don't wait for cross-continent round trips.

Problems Introduced by Replication

⚔️ Conflicts

Two users update the same data on different replicas simultaneously. Which update wins? How do you merge them?

⏳ Delays

Replication takes time. A write to Node A hasn't reached Node B yet. A read from Node B returns stale data.

💥 Failures

A network partition splits nodes into groups that can't communicate. Each group might accept conflicting writes.

💡 The Fundamental Trade-off

You can't have it all. Stronger consistency requires more coordination between nodes, which means higher latency and lower availability. Weaker consistency allows faster responses and higher availability, but clients may see stale or conflicting data. Every distributed system sits somewhere on this spectrum.

03

CAP Theorem — Deep Dive

The CAP theorem states that a distributed system can provide at most two of three guarantees simultaneously: Consistency, Availability, and Partition Tolerance. Since network partitions are inevitable in any distributed system, the real choice is between Consistency and Availability during a partition.

🔒

Consistency (C)

Every read receives the most recent write or an error. All nodes see the same data at the same time. If you write 'balance = $500', every node immediately returns $500.

🟢

Availability (A)

Every request receives a response (not an error), even if it might not contain the most recent write. The system always responds, even if the data might be stale.

🔌

Partition Tolerance (P)

The system continues to operate despite network partitions — when some nodes can't communicate with others. Messages between nodes are lost or delayed.

🏦

Bank Branches During a Storm

Your bank has branches in Mumbai and Delhi. A storm cuts the network between them (partition). A customer in Mumbai deposits $500. Now the bank has a choice: CP (Consistency + Partition Tolerance): The Delhi branch refuses to serve any requests until it can sync with Mumbai. 'Sorry, system is down.' Customers are frustrated, but no one sees wrong data. AP (Availability + Partition Tolerance): The Delhi branch keeps serving requests with the last known balance. Customers get responses, but the balance might be stale. When the network recovers, the branches reconcile. You can't have CA during a partition — that would mean both branches show the correct balance AND respond to every request, which is impossible when they can't communicate.

Why Partition Tolerance Is Non-Negotiable

In any real distributed system, network partitions will happen. Cables get cut. Routers fail. Cloud availability zones lose connectivity. You can't choose "no partitions" — you can only choose how to behave when they occur. This means the real choice is always CP or AP.

System TypeDuring PartitionTrade-offExamples
CP (Consistent + Partition Tolerant)Rejects requests until partition healsSacrifices availability for correctnessHBase, MongoDB (strong read), Zookeeper, etcd
AP (Available + Partition Tolerant)Serves requests with potentially stale dataSacrifices consistency for availabilityCassandra, DynamoDB, CouchDB, DNS
CAP in Action — What Happens During a Partitiontext
Network partition splits the cluster:

  [Node A, Node B]  ←— can't talk to —→  [Node C]

Client writes "balance = $500" to Node A.
Node A and B are updated. Node C still has the old value ($1000).

CP System (e.g., HBase):
Node C stops accepting reads/writes
Returns error: "Service unavailable"
When partition heals, Node C syncs and becomes available
Guarantee: no client ever sees stale data

AP System (e.g., Cassandra):
Node C keeps accepting reads/writes
Returns old value: "balance = $1000" (stale!)
When partition heals, conflict resolution merges the data
Guarantee: every client always gets a response

🎯 Interview Insight

Don't say "I'd choose CA." In a distributed system, partitions are inevitable, so CA is not a real option. The correct framing: "During a partition, I'd choose CP for financial data (correctness matters more) and AP for social feeds (availability matters more)." Different parts of the same system can make different choices.

04

PACELC Model

CAP only describes behavior during partitions. But what about normal operation — when there's no partition? That's where PACELC comes in. It extends CAP to cover the full picture.

PACELC — The Full Frameworktext
If there is a Partition (P):
Choose between Availability (A) and Consistency (C)

Else (E) — normal operation, no partition:
Choose between Latency (L) and Consistency (C)

Full notation: PA/EL, PC/EC, PA/EC, PC/EL

Examples:
  CassandraPA/EL  (Available during partition, low latency normally)
  DynamoDBPA/EL  (Sameavailability and speed first)
  HBasePC/EC  (Consistent during partition, consistent normally)
  MongoDBPC/EC  (Strong consistency by default)
  Cosmos DBConfigurable (you choose per query!)

Why PACELC Is More Realistic Than CAP

CAP only matters during partitions — which are rare (minutes per year). PACELC captures the trade-off you face every single request: do you wait for all replicas to confirm (consistency, higher latency) or respond from the nearest replica immediately (low latency, possibly stale)?

⚡ EL — Else Latency (Speed First)

  • Read from the nearest replica immediately
  • Don't wait for all replicas to agree
  • Response in 1-5ms (local replica)
  • Risk: data might be slightly stale
  • Use case: social feeds, product catalogs, caching

🔒 EC — Else Consistency (Correctness First)

  • Wait for majority of replicas to confirm
  • Every read sees the latest write
  • Response in 10-100ms (coordination overhead)
  • Guarantee: no stale reads
  • Use case: banking, inventory, bookings
DatabaseDuring Partition (P)Normal Operation (E)PACELC
CassandraAvailable (A)Low Latency (L)PA/EL
DynamoDBAvailable (A)Low Latency (L)PA/EL
MongoDBConsistent (C)Consistent (C)PC/EC
HBaseConsistent (C)Consistent (C)PC/EC
Cosmos DBConfigurableConfigurableTunable
PostgreSQL (single)N/A (not distributed)Consistent (C)EC

🎯 Interview Insight

When discussing database choices, use PACELC instead of just CAP. Say: "Cassandra is PA/EL — it prioritizes availability during partitions and low latency during normal operation. That makes it ideal for our social feed where stale data is acceptable but downtime is not." This shows deeper understanding than just saying "Cassandra is AP."

05

Strong vs Eventual Consistency

This is the most practical trade-off you'll face in system design. Strong consistency means every read returns the latest write. Eventual consistency means reads might return stale data, but all replicas will converge over time.

📝

The Shared Document Analogy

Strong consistency is like Google Docs with real-time sync — every collaborator sees every keystroke instantly. If Alice types 'Hello', Bob sees 'Hello' immediately. This requires constant communication between all participants. Eventual consistency is like emailing a document back and forth. Alice edits her copy, Bob edits his copy. Eventually they merge changes. For a while, they have different versions. But they'll converge. Faster to work independently, but you might have conflicts.

🔒 Strong Consistency

  • Every read returns the most recent write
  • All nodes agree before acknowledging a write
  • Higher latency (coordination overhead)
  • Lower availability during partitions
  • Simpler application logic (no stale data handling)

⏳ Eventual Consistency

  • Reads might return stale data temporarily
  • Write acknowledged after one replica confirms
  • Lower latency (no coordination needed)
  • Higher availability (always responds)
  • Complex application logic (handle stale reads, conflicts)
Strong vs Eventual — What Users Experiencetext
Scenario: User updates their profile name from "Alice" to "Alicia"

Strong Consistency:
  1. Write "Alicia"wait for all replicas to confirm
  2. Any subsequent read from ANY server"Alicia"
  3. User refreshes pagesees "Alicia" immediately
  4. Latency: 50-200ms (coordination across replicas)

Eventual Consistency:
  1. Write "Alicia"acknowledged after 1 replica confirms
  2. Read from same server"Alicia"
  3. Read from different servermight still show "Alice"
  4. After 100ms-2s, all replicas converge"Alicia" everywhere
  5. Latency: 5-20ms (no coordination)

User experience impact:
User updates name, refreshes, sees old name briefly
"Did my update not save?"confusion
Fix: read-your-own-writes consistency (read from the replica you wrote to)

When Is Eventual Consistency Acceptable?

Data TypeConsistency NeededWhy
Bank balanceStrongShowing wrong balance → overdraft, fraud, legal issues
Inventory countStrongStale count → overselling, angry customers
Social media likesEventualShowing 1,023 instead of 1,024 likes → nobody notices
News feedEventualSeeing a post 2 seconds late → acceptable
Shopping cartSessionUser must see their own updates; other users don't matter
LeaderboardEventualScores updating with a few seconds delay → fine
Flight bookingStrongDouble-booking a seat → catastrophic

⚠️ Stale Reads — The Hidden UX Problem

The most common user complaint with eventual consistency: "I just updated my profile but it still shows the old name." Fix: use "read-your-own-writes" consistency — route the user's reads to the same replica they wrote to. Other users can see eventual consistency; the writing user sees strong consistency for their own data.

06

Linearizability

Linearizability is the strongest form of consistency. It guarantees that all operations appear to execute in a single, global order — as if there's only one copy of the data, even though it's replicated across multiple nodes.

🎫

The Billing Counter Queue

Imagine a single billing counter at a store. Customers are served in the exact order they arrive. If Alice arrives before Bob, Alice is served first — always. There's no ambiguity about the order. Linearizability is this queue applied to a distributed system. Even though there are multiple replicas (multiple counters), operations appear as if they went through a single counter in a strict order. If operation A completes before operation B starts, then every observer sees A's effect before B's.

Linearizability vs Eventual Consistency

The Difference — Visualizedtext
Timeline:
  T1: Client A writes x = 1
  T2: Client A gets acknowledgment (write complete)
  T3: Client B reads x

Linearizable:
  Client B reads xMUST return 1
  Because the write completed (T2) before the read started (T3),
  the read must see the write's effect. No exceptions.

Eventually Consistent:
  Client B reads xmight return 0 (old value)
  The write happened, but the replica Client B is reading from
  hasn't received the update yet. It will... eventually.

Linearizability = real-time ordering guarantee
Eventual consistency = no ordering guarantee (just convergence)

When linearizability is required

  • Leader election — all nodes must agree on who the leader is
  • Distributed locks — only one process holds the lock at a time
  • Unique constraints — no two users can register the same username
  • Financial transactions — account balance must be globally consistent
  • Coordination services — Zookeeper, etcd rely on linearizability

Why linearizability is expensive

  • Requires coordination on every operation (consensus protocol)
  • Latency increases with geographic distance between replicas
  • Throughput is limited by the slowest replica in the quorum
  • Availability drops during partitions (CP trade-off)
  • Most applications don't actually need it for most data

🎯 Interview Insight

Don't say "we need linearizability everywhere." It's expensive and rarely needed for all data. Say: "We need linearizability for the lock service and unique username checks, but eventual consistency is fine for the news feed and like counts." This shows you understand the cost and apply it surgically.

07

End-to-End Scenario

Let's design the consistency model for two real systems — a banking platform and a social media platform — and see how the same concepts lead to completely different decisions.

🏦 Banking Platform

CAP choice: CP — during a partition, reject transactions rather than risk incorrect balances. A user seeing "service unavailable" is better than seeing a wrong balance and overdrawing.

PACELC: PC/EC — even during normal operation, we choose consistency over latency. A transfer taking 200ms instead of 20ms is acceptable. A wrong balance is not.

Consistency model: Strong consistency (linearizable for balance updates). Every read must return the latest balance. Use a consensus protocol (Raft/Paxos) for the transaction log.

User experience: Slightly higher latency (50-200ms per transaction). Occasional "service unavailable" during rare partitions. But the balance is always correct. Users trust the system.

Database: PostgreSQL with synchronous replication, or CockroachDB (distributed SQL with strong consistency).

📱 Social Media Platform

CAP choice: AP — during a partition, keep serving feeds and accepting posts. A user seeing a slightly stale feed is better than seeing "service unavailable." Social media must always be available.

PACELC: PA/EL — even during normal operation, we choose latency over consistency. The feed loading in 20ms with slightly stale data is better than 200ms with perfectly fresh data.

Consistency model: Eventual consistency for feeds, likes, and comments. Read-your-own-writes for the posting user (they should see their own post immediately). Strong consistency only for account operations (login, password change).

User experience: Ultra-fast feed loading. A new post might take 1-2 seconds to appear for other users. Like counts might be briefly inaccurate. Nobody notices or cares.

Database: Cassandra for feeds and activity data (PA/EL). PostgreSQL for user accounts (PC/EC). Redis for real-time counters (eventual, fast).

🔥 The Key Takeaway

The same system often uses different consistency models for different data. A social media platform uses strong consistency for passwords and eventual consistency for likes. A banking platform uses strong consistency for balances but might use eventual consistency for transaction history display. Match the consistency model to the data's requirements, not to the system as a whole.

08

Trade-offs & Decision Making

Consistency vs Availability

DimensionChoose ConsistencyChoose Availability
During partitionReject requests (errors)Serve stale data
User experienceOccasional downtimeAlways responsive
Data correctnessAlways correctTemporarily stale
Use whenWrong data causes real harmDowntime causes real harm
ExamplesBanking, inventory, bookingsSocial media, CDN, DNS

Latency vs Correctness

DimensionChoose Low LatencyChoose Correctness
Read pathRead from nearest replica (5ms)Read from leader or quorum (50-200ms)
Write pathAcknowledge after 1 replica (5ms)Acknowledge after majority (50-200ms)
Stale readsPossible (ms to seconds)Never
ThroughputHigher (no coordination)Lower (coordination overhead)
Use whenSpeed > accuracy (feeds, caching)Accuracy > speed (finance, inventory)

Complexity vs Guarantees

DimensionSimple (Eventual)Complex (Strong/Linearizable)
ImplementationWrite to one replica, async replicationConsensus protocol (Raft, Paxos)
Failure handlingConflict resolution after the factPrevent conflicts before they happen
Application codeMust handle stale reads, conflictsSimpler (data is always correct)
Operational costLower (fewer coordination failures)Higher (consensus can stall)
DebuggingHarder (non-deterministic behavior)Easier (deterministic ordering)

🎯 Decision Framework

For each piece of data, ask: "What's the cost of a stale read?" If the answer is "user sees 1,023 likes instead of 1,024" → eventual consistency. If the answer is "user sees $1,000 instead of $500 and withdraws money they don't have" → strong consistency. The cost of inconsistency determines the consistency model.

09

Interview Questions

Conceptual, scenario-based, and trick questions you're likely to encounter.

Q:Can you achieve CA (Consistency + Availability) in a distributed system?

A: Not during a network partition. The CAP theorem proves that when nodes can't communicate (partition), you must choose: respond with potentially stale data (A) or refuse to respond until the partition heals (C). However, when there's no partition (which is most of the time), you CAN have both consistency and availability — that's the 'E' in PACELC. A single-node database (non-distributed) is effectively CA, but it's not partition-tolerant because it's a single point of failure.

Q:What is eventual consistency?

A: A consistency model where, after a write, not all replicas are updated immediately. For a brief period, different replicas may return different values. If no new writes occur, all replicas will eventually converge to the same value. 'Eventually' is typically milliseconds to seconds, not hours. The key insight: eventual consistency is not 'no consistency' — it's a guarantee that the system WILL converge. It's a deliberate trade-off for lower latency and higher availability.

Q:When would you choose AP over CP?

A: When availability is more valuable than perfect consistency. Examples: (1) Social media feeds — users would rather see a slightly stale feed than a 'service unavailable' page. (2) DNS — returning a cached (possibly stale) IP is better than failing to resolve. (3) Shopping cart — losing a cart update is bad, but the entire site being down is worse. (4) CDN — serving a cached page is better than no page. The common thread: the cost of brief inconsistency is lower than the cost of downtime.

1

You're designing a distributed counter for video view counts

What consistency model would you use?

Answer: Eventual consistency. View counts don't need to be perfectly accurate in real-time. If the count shows 1,000,042 instead of 1,000,045 for a few seconds, nobody notices. Use a PA/EL system (Cassandra or Redis) where each node increments its local counter and periodically syncs. This gives you ultra-low latency writes (critical for high-traffic videos) and high availability. The count converges within seconds.

2

Two users try to register the same username simultaneously

What consistency guarantee do you need?

Answer: Linearizability. The username uniqueness check must be globally ordered — if User A's registration completes before User B's starts, User B must see that the username is taken. With eventual consistency, both users could check simultaneously, both see the username as available, and both register — violating the uniqueness constraint. Use a linearizable store (etcd, Zookeeper) or a single-leader database with a UNIQUE constraint for this specific operation.

3

Your globally distributed database has replicas in US, EU, and Asia

A user in Asia writes data. How quickly should a user in the US see it?

Answer: It depends on the consistency model. Strong consistency: the US user sees it immediately (but the write takes ~200ms because it must propagate to a quorum across continents). Eventual consistency: the write completes in ~5ms (local replica), but the US user might see stale data for 100-500ms until replication catches up. For most applications, eventual consistency with read-your-own-writes is the sweet spot — the writing user sees their update immediately, other users see it within a second.

10

Common Mistakes

These misconceptions trip up engineers in interviews and lead to bad architectural decisions.

📐

Misunderstanding CAP theorem

Saying 'our system is CA' or 'we chose all three.' CAP says you can't have all three DURING A PARTITION. Many people also think CAP means you permanently give up one property — it's actually about behavior during the (rare) partition event.

Frame it correctly: 'During a partition, we choose consistency over availability for financial data (CP), and availability over consistency for the news feed (AP). When there's no partition, we have both.' Also mention PACELC for the normal-operation trade-off.

🔒

Assuming strong consistency is always better

Defaulting to strong consistency for everything because 'correctness is important.' Strong consistency has real costs: higher latency (50-200ms vs 5ms), lower throughput (coordination overhead), and lower availability during partitions. For most data, these costs aren't justified.

Apply strong consistency surgically — only where the cost of a stale read is unacceptable (money, inventory, bookings). Use eventual consistency for everything else (feeds, likes, analytics, logs). Most data in most systems is fine with eventual consistency.

⏱️

Ignoring latency trade-offs

Focusing only on the CAP partition scenario and ignoring that the latency vs consistency trade-off (PACELC's 'E') affects EVERY request, not just during rare partitions. A system that's 10x slower for perfect consistency might be worse than one that's fast with occasional stale reads.

Use PACELC thinking. Ask: 'During normal operation, do we want 5ms reads (eventual) or 50ms reads (strong)?' For a feed that loads 100 times per second, that 45ms difference is the difference between a snappy app and a sluggish one.

🔀

Confusing linearizability with strong consistency

Using 'strong consistency' and 'linearizability' interchangeably. Linearizability is a specific, formal guarantee: operations appear in a single global order that respects real-time. Strong consistency is a broader term that can mean different things in different contexts (sequential consistency, causal consistency, etc.).

Be precise: linearizability = strongest guarantee (real-time ordering). Sequential consistency = operations appear in some total order (but not necessarily real-time). Causal consistency = causally related operations are ordered. In interviews, say 'linearizable' when you mean the strongest guarantee, not just 'strongly consistent.'

🌐

Applying one consistency model to the entire system

Saying 'our system uses eventual consistency' or 'our system is CP.' Real systems use different consistency models for different data. A social media platform uses strong consistency for passwords, eventual consistency for feeds, and linearizability for username uniqueness.

Design consistency per data type, not per system. Map each piece of data to its consistency requirement based on the cost of a stale read. This is how every large-scale system actually works.