1 min read

Average Thinking vs Worst-Case Thinking in System Design

When designing a system, how do you estimate its capacity? Many engineers calculate based on average traffic. Experienced engineers plan for the worst case. The difference between these approaches determines whether a system survives traffic spikes or collapses.

Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, "Site Reliability Engineering" (Google), "The Art of Scalability" (Abbott, Fisher).


The Flaw in Average Thinking

Consider a social media platform averaging 1,000 requests per second. The engineer provisions for 1,200 RPS (20% headroom).

On a normal day, this works fine. But when a celebrity posts about the platform, traffic spikes to 10,000 RPS. The system collapses. Users see errors. The celebrity deletes their post. The opportunity is lost.

The problem: average thinking ignores traffic patterns.


Understanding Traffic Patterns

Real traffic is not uniform. It follows patterns:

  • Diurnal patterns: Traffic peaks during business hours, drops at night
  • Event-driven spikes: Product launches, marketing campaigns, breaking news
  • Seasonal peaks: Black Friday for e-commerce, tax day for accounting software
  • Viral effects: A social media post driving traffic to a site not built for it

Each pattern requires different capacity planning.


Percentile-Based Thinking

Instead of averages, use percentiles:

| Metric | What It Measures | Why It Matters | |--------|-----------------|----------------| | P50 (median) | Typical experience | "Average" user experience | | P95 | 95th percentile | Most users have this or better | | P99 | 99th percentile | The slowest 1% of requests | | P999 | 99.9th percentile | Edge cases |

A service with 100 ms P50 and 2,000 ms P99 means half of users are fast, but 1% of users experience 2-second waits. The P99 often reveals problems that P50 hides.


Key Takeaways

  1. Average thinking underestimates the resources needed to handle traffic spikes.
  2. Understand your traffic patterns — diurnal, event-driven, seasonal, viral.
  3. Plan for peak throughput, not average. Over-provision or auto-scale.
  4. Use percentiles (P50, P95, P99) instead of averages for latency measurements.
  5. Worst-case thinking builds resilience. Average thinking leads to outages.