15 May 2026 4 min read system-design

Designing a Real-Time Chat System for Millions of Users

Real-time chat is one of the most challenging system design problems. Messages must be delivered instantly, in order, with low latency, across millions of concurrent connections. This article explores the architecture behind messaging platforms like WhatsApp, Telegram, and Discord.

Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, WhatsApp engineering blog, Discord engineering blog, Telegram MTProto documentation.

Requirements

A chat system at scale must handle:

Low-latency delivery: Messages delivered in under 100 milliseconds
Ordering: Messages within a conversation appear in the correct order
Delivery guarantees: At-least-once or exactly-once delivery
Multi-device support: Seamless sync across phone, desktop, and web
Offline messages: Store and forward messages when recipients are offline
Group chats: Broadcast messages to hundreds or thousands of members
Media sharing: Images, videos, documents, voice messages

Architecture Overview

A chat system has three main components:

Connection management: Maintain persistent connections with clients
Message routing: Deliver messages to the correct recipients
Message storage: Persist messages for history and offline delivery

Client A → WebSocket → Chat Server → Message Queue → Chat Server → WebSocket → Client B │ ↓ Database

Connection Management

HTTP is connectionless — each request opens a new connection. This is unsuitable for real-time communication where the server needs to push messages to clients.

WebSocket

WebSocket provides a persistent, bidirectional connection between client and server. The client opens a TCP connection, upgrades it to WebSocket via HTTP, and keeps the connection open.

text Client → HTTP Upgrade Request → Server Client ← WebSocket Connection ← Server Client → Message → Server Client ← Message ← Server

Each chat server handles tens of thousands of WebSocket connections. To scale beyond that, add more chat servers behind a load balancer.

Connection Affinity

When a client connects to a server, all messages for that client must go through that server. This is called connection affinity or sticky sessions.

Load Balancer → Server 1 → Clients A, B, C Load Balancer → Server 2 → Clients D, E, F

When Server 1 needs to send a message to Client D (connected to Server 2), it must forward the message to Server 2. This requires an internal routing layer.

Message Routing

Messages flow through the system in three stages:

1. Sender to Server

The sender's client sends the message over WebSocket to its connected chat server. The server:

Validates the message format
Checks permissions (can the sender message this recipient?)
Assigns a unique message ID (typically a timestamp + node ID + sequence number)
Stores the message in the database

2. Server-to-Server Routing

Each chat server maintains a registry of which clients are connected to it. When Server 1 receives a message for a recipient on Server 2, it must route the message internally.

This routing can use:

Redis pub/sub: Servers subscribe to channels named by recipient ID. A message published to a channel reaches all subscribed servers.
Direct TCP connections: Servers maintain a mesh of TCP connections for forwarding messages.
Message queue: Servers publish messages to a queue, and the consuming server delivers them.

3. Server to Recipient

The server that has the recipient's WebSocket connection receives the message and pushes it over the WebSocket.

If the recipient is offline, the message is stored and delivered when they reconnect.

Message Ordering

Messages in a conversation must appear in order. This is straightforward with a single server but requires coordination across servers.

Approach: Use a sequence number per conversation. The server maintaining the conversation state assigns sequence numbers atomically. Clients display messages in sequence number order, regardless of arrival order.

text Message "Hello" → Sequence 1 Message "How are you?" → Sequence 2 Message "Hi!" → Sequence 3 (may arrive before "How are you?") Client displays: "Hello" → "Hi!" → "How are you?" (reordered by sequence)

Offline Messages and Delivery Guarantees

Offline Storage

When a recipient is offline, the server stores messages in a database. When the recipient reconnects, the server:

Queries all undelivered messages for that user
Sends them in chronological order
Updates the delivery status

Delivery Guarantees

| Guarantee | Meaning | Implementation | |-----------|---------|---------------| | At-most-once | Message may be lost | Fire and forget. Fast but unreliable. | | At-least-once | Message delivered at least once | Ack/nack protocol. Recipient acknowledges receipt. Server retries on timeout. | | Exactly-once | Message delivered exactly once | Deduplication. Recipient tracks received message IDs and rejects duplicates. |

Most chat systems use at-least-once delivery. Exactly-once is complex and usually unnecessary for chat — a duplicate message is preferable to a lost one.

Group Chats

Group chats introduce the problem of fan-out: a single message must reach N recipients.

Approaches

Fan-out on write: The sender sends once. The server writes a copy of the message to each recipient's inbox. Easy to implement but requires N writes per message.

Fan-out on read: The sender sends once. The server stores the message once in a shared conversation log. Each recipient reads from the log. Efficient for storage but requires tracking each recipient's read position.

Many systems use a hybrid approach:

Small groups (up to 100): Fan-out on write — each recipient gets their own copy
Large groups (hundreds or thousands): Fan-out on read — recipients read from a shared log

WhatsApp uses this hybrid approach. Small group messages are written to each member's individual message store. Large broadcast messages are not fanned out at all — recipients pull messages when they open the app.

Scaling Challenges

| Challenge | Solution | |-----------|----------| | Millions of concurrent connections | Multiple chat servers behind a load balancer | | High message throughput | Partitioning by conversation ID or user ID | | Geographic latency | Regional chat servers with cross-region routing | | Message durability | Write-ahead log + database replication | | Media storage | Object storage (S3) with CDN delivery |

Production Example: WhatsApp

WhatsApp handles over 100 billion messages per day. Their architecture:

Custom Erlang-based server — each server handles ~1 million connections
Message persistence: Messages are stored for 30 days, then deleted
No message fan-out: Each message is stored once. Recipients pull messages.
End-to-end encryption: The Signal Protocol ensures messages are encrypted between sender and receiver. Servers cannot read message content.
Offline delivery: Messages are stored for 30 days. When a recipient comes online, all pending messages are delivered.

Key Takeaways

WebSocket provides persistent bidirectional connections necessary for real-time communication.
Connection affinity routes messages through the correct server for each client.
Message ordering is maintained through sequence numbers per conversation.
Offline messages are stored and delivered when the recipient reconnects.
Group chat requires fan-out strategies that depend on group size.
WhatsApp processes billions of messages daily using a pull-based model with end-to-end encryption.

Design principle: State is the bottleneck. Minimize server-side state and push as much logic as possible to the client.