Understanding the CAP Theorem Before Your Next Interview
You are in a system design interview. The interviewer asks: "How would you design a distributed database?"
Your brain knows you need to talk about the CAP theorem. But what does it actually mean?
Let us break it down, because this single concept is the foundation of every distributed system in the real world.
Key sources: Eric Brewer's original 2000 PODC keynote, "Brewer's CAP Theorem" by Seth Gilbert and Nancy Lynch (MIT), and real-world engineering practices from Amazon, Google, and MongoDB.
The Theorem in One Sentence
You can have at most two out of three guarantees in a distributed system:
| Letter | Stands For | Meaning | |--------|-----------|---------| | C | Consistency | Every read sees the same data or gets an error | | A | Availability | Every request gets a response, even if data is stale | | P | Partition Tolerance | System keeps working even if network breaks |
The Three Guarantees
Consistency: Everyone Sees the Same Thing
Imagine you and your friend are looking at the same whiteboard. If you write "Meeting at 3 PM" on it, your friend should see "Meeting at 3 PM" immediately, not a different value for the next 5 seconds.
In distributed systems, Consistency means that when a write happens, every subsequent read from any node returns that write. If a node cannot guarantee this, it returns an error.
Real example: A bank ATM. When you withdraw money, the balance must update everywhere immediately. Bank systems prioritize C.
Availability: The System Is Always Up
No matter what, the system responds to your request. Even if some servers are on fire. Even if the network is flaky. You always get a reply.
Real example: YouTube. If you cannot see the exact latest comment count, that is fine. But if the page does not load at all, you leave. YouTube prioritizes A.
Partition Tolerance: Surviving Network Failure
A partition is when the network between servers gets cut. Some servers cannot talk to others.
Partition Tolerance means that even when the network breaks, the system keeps working. Since network failures are inevitable (cables get cut, routers fail, packets drop), every distributed system MUST be partition tolerant.
This is the key insight: you do not choose whether to have P. You have to have P. So your real choice is between C and A.
The Trade-off: CP vs AP
Since you must pick P (partition tolerance is not optional in a distributed system), your decision is really between CP and AP.
CP System (Consistency + Partition Tolerance)
When the network breaks, the system stops accepting writes on the disconnected side. It prefers accuracy over availability.
Behavior during partition: Some nodes become read-only or return errors until the network heals.
Examples: - Google Bigtable — They would rather return an error than serve stale data - Traditional banking systems — If one branch cannot sync, it stops transactions - ZooKeeper — Configuration coordination requires correctness above all
AP System (Availability + Partition Tolerance)
When the network breaks, both sides keep accepting requests, even if they might have stale data. They sync up later.
Behavior during partition: Every node keeps serving reads and writes. Data might be inconsistent temporarily (eventual consistency).
Examples: - Amazon DynamoDB — Adding to a cart should always work, even during network issues - DNS — Cached results might be a few minutes old, but the internet keeps working - Social media — Seeing slightly old like counts is acceptable
Real System Choices
| System | CAP Choice | Why | |--------|-----------|-----| | Amazon DynamoDB | AP | Shopping cart must work during failures | | Google Bigtable | CP | Search index must be consistent | | MongoDB | Configurable | Default is CP | | Cassandra | AP | High write throughput, eventual consistency | | Redis | CP | Must have accurate cache state | | PostgreSQL (single) | CA* | Single-node, no partition to worry about |
Single-node systems can be CA, but that does not apply to distributed databases.
Common Interview Question
Does this mean CA systems do not exist in distributed databases?
Correct. CA (Consistency + Availability without Partition Tolerance) only works on a single machine. As soon as you have multiple nodes, network partitions will happen, and you must choose between C and A.
This is why Brewer's observation is so influential: "Partition tolerance is not optional."
Key Takeaways
- You always have P — network partitions are inevitable.
- Your real choice is CP versus AP.
- CP is accurate but may be unavailable during failures.
- AP is always available but may serve stale data during failures.
- CA only works on single machines, not in distributed systems.
For your next interview: When asked "How would you design X?", start by naming your CAP trade-off. Say "This is a CP system because..." or "I would make this AP because..." — interviewers appreciate when you show you understand the fundamental trade-off.