15 May 2026 1 min read system-design

Average Thinking vs Worst-Case Thinking in System Design

When designing a system, how do you estimate its capacity? Many engineers calculate based on average traffic. Experienced engineers plan for the worst case. The difference between these approaches determines whether a system survives traffic spikes or collapses.

Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, "Site Reliability Engineering" (Google), "The Art of Scalability" (Abbott, Fisher).

The Flaw in Average Thinking

Consider a social media platform averaging 1,000 requests per second. The engineer provisions for 1,200 RPS (20% headroom).

On a normal day, this works fine. But when a celebrity posts about the platform, traffic spikes to 10,000 RPS. The system collapses. Users see errors. The celebrity deletes their post. The opportunity is lost.

The problem: average thinking ignores traffic patterns.

Understanding Traffic Patterns

Real traffic is not uniform. It follows patterns:

Diurnal patterns: Traffic peaks during business hours, drops at night
Event-driven spikes: Product launches, marketing campaigns, breaking news
Seasonal peaks: Black Friday for e-commerce, tax day for accounting software
Viral effects: A social media post driving traffic to a site not built for it

Each pattern requires different capacity planning.

Percentile-Based Thinking

Instead of averages, use percentiles:

| Metric | What It Measures | Why It Matters | |--------|-----------------|----------------| | P50 (median) | Typical experience | "Average" user experience | | P95 | 95th percentile | Most users have this or better | | P99 | 99th percentile | The slowest 1% of requests | | P999 | 99.9th percentile | Edge cases |

A service with 100 ms P50 and 2,000 ms P99 means half of users are fast, but 1% of users experience 2-second waits. The P99 often reveals problems that P50 hides.

Key Takeaways

Average thinking underestimates the resources needed to handle traffic spikes.
Understand your traffic patterns — diurnal, event-driven, seasonal, viral.
Plan for peak throughput, not average. Over-provision or auto-scale.
Use percentiles (P50, P95, P99) instead of averages for latency measurements.
Worst-case thinking builds resilience. Average thinking leads to outages.

The Flaw in Average Thinking

Understanding Traffic Patterns

Percentile-Based Thinking

Key Takeaways

You might also like...

🌐 Polling vs Long Polling vs WebSockets: How Apps Stay Updated Instantly

Nilai dari 'Boring Technology': Kenapa Stack Biasa Sering Menang

Why the Observer Pattern Powers Modern Frontend Frameworks

Why Thinking Out Loud Makes You a Better Engineer

Why Great System Design Always Starts With Better Questions