Average Thinking vs Worst-Case Thinking in System Design
When designing a system, how do you estimate its capacity? Many engineers calculate based on average traffic. Experienced engineers plan for the worst case. The difference between these approaches determines whether a system survives traffic spikes or collapses.
Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, "Site Reliability Engineering" (Google), "The Art of Scalability" (Abbott, Fisher).
The Flaw in Average Thinking
Consider a social media platform averaging 1,000 requests per second. The engineer provisions for 1,200 RPS (20% headroom).
On a normal day, this works fine. But when a celebrity posts about the platform, traffic spikes to 10,000 RPS. The system collapses. Users see errors. The celebrity deletes their post. The opportunity is lost.
The problem: average thinking ignores traffic patterns.
Understanding Traffic Patterns
Real traffic is not uniform. It follows patterns:
- Diurnal patterns: Traffic peaks during business hours, drops at night
- Event-driven spikes: Product launches, marketing campaigns, breaking news
- Seasonal peaks: Black Friday for e-commerce, tax day for accounting software
- Viral effects: A social media post driving traffic to a site not built for it
Each pattern requires different capacity planning.
Percentile-Based Thinking
Instead of averages, use percentiles:
| Metric | What It Measures | Why It Matters | |--------|-----------------|----------------| | P50 (median) | Typical experience | "Average" user experience | | P95 | 95th percentile | Most users have this or better | | P99 | 99th percentile | The slowest 1% of requests | | P999 | 99.9th percentile | Edge cases |
A service with 100 ms P50 and 2,000 ms P99 means half of users are fast, but 1% of users experience 2-second waits. The P99 often reveals problems that P50 hides.
Key Takeaways
- Average thinking underestimates the resources needed to handle traffic spikes.
- Understand your traffic patterns — diurnal, event-driven, seasonal, viral.
- Plan for peak throughput, not average. Over-provision or auto-scale.
- Use percentiles (P50, P95, P99) instead of averages for latency measurements.
- Worst-case thinking builds resilience. Average thinking leads to outages.