Why Delivering a Message Exactly Once Is So Difficult
Message delivery guarantees are a fundamental concern in distributed systems. At-most-once, at-least-once, and exactly-once are the three standard guarantees. This article focuses on exactly-once delivery — why it is so desirable, why it is so difficult, and how production systems approximate it.
Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, Kafka documentation, "The Part-Time Parliament" by Leslie Lamport.
The Three Delivery Guarantees
| Guarantee | Behavior | Practical Meaning | |-----------|----------|-------------------| | At-most-once | Message delivered 0 or 1 times | May be lost, never duplicated | | At-least-once | Message delivered 1 or more times | Never lost, may be duplicated | | Exactly-once | Message delivered exactly 1 time | Never lost, never duplicated |
At-most-once is simple but unreliable. At-least-once is reliable but produces duplicates. Exactly-once is what everyone wants, but it requires solving fundamental distributed systems problems.
Why Exactly-Once Is Hard
The difficulty comes from two fundamental constraints:
Networks Are Unreliable
When a sender sends a message and waits for an acknowledgment, it cannot distinguish between:
- The message was lost — the receiver never got it
- The acknowledgment was lost — the receiver got it but the sender cannot tell
- The receiver failed — it got the message but crashed before acknowledging
- The receiver is slow — it will acknowledge eventually
In cases 2-4, the sender does not know whether to retransmit. Retransmission may cause a duplicate (case 2). Not retransmitting may lose the message (case 1).
Systems Fail Between Steps
Consider a payment system:
- Debit money from account A
- Send a message to credit account B
- Mark the transaction as complete
If the system crashes after step 2 but before step 3, the message was sent but its completion was not recorded. On restart, the system may resend the message, causing a duplicate credit to account B.
Approaches to Exactly-Once
1. Idempotent Receivers
An idempotent operation produces the same result regardless of how many times it is applied. If the receiver is idempotent, at-least-once delivery achieves the same effect as exactly-once.
Example — setting an account balance:
python
def set_balance(account_id, new_balance):
db.execute("UPDATE accounts SET balance = ? WHERE id = ?",
new_balance, account_id)
Example — incrementing a balance (NOT idempotent):
python
def increment_balance(account_id, amount):
db.execute("UPDATE accounts SET balance = balance + ? WHERE id = ?",
amount, account_id)
Making receivers idempotent is the most practical way to achieve exactly-once semantics. The sender retries until it gets a response, and the receiver handles duplicates safely.
2. Deduplication
The sender includes a unique message ID (UUID, sequence number, or transaction ID). The receiver tracks processed message IDs and ignores duplicates.
python
def receive_message(message):
if message.id in processed_ids:
return # Duplicate — ignore
process(message)
processed_ids.add(message.id)
This requires persistent storage of processed IDs. The storage must survive crashes. The deduplication window must be long enough to cover the maximum retry interval.
3. Two-Phase Commit (2PC)
2PC coordinates a transaction across multiple participants. A coordinator asks all participants to prepare. Once all confirm readiness, the coordinator tells them to commit.
```text Phase 1 (Prepare): Coordinator → Participant A: Can you commit? Coordinator → Participant B: Can you commit? A: Yes B: Yes
Phase 2 (Commit): Coordinator → Participant A: Commit Coordinator → Participant B: Commit A: Done B: Done ```
If any participant fails during phase 1, the transaction is aborted everywhere. If the coordinator crashes during phase 2, participants remain in a prepared state until the coordinator recovers.
2PC provides atomic commit across multiple systems. The cost is higher latency (multiple round trips) and the risk of blocking if the coordinator fails.
4. Kafka's Exactly-Once Semantics
Apache Kafka implements exactly-once semantics through a combination of:
- Idempotent producers: Each message has a sequence number. The broker deduplicates by producer ID and sequence number.
- Transactional writes: A producer can begin a transaction, send multiple messages, and commit atomically. Either all messages in the transaction are visible, or none are.
- Consumer transaction awareness: Consumers can read only committed messages and track their offsets transactionally.
python
producer.init_transactions()
producer.begin_transaction()
try:
producer.send("topic1", value1)
producer.send("topic2", value2)
producer.commit_transaction()
except:
producer.abort_transaction()
This provides exactly-once delivery within the Kafka ecosystem, from producer to consumer, as long as both are Kafka-aware.
The FLP Impossibility Result
The Fischer, Lynch, and Paterson (FLP) impossibility result states that in an asynchronous distributed system where processes can fail, it is impossible to guarantee agreement on a value in a single round of communication.
This means exactly-once delivery cannot be guaranteed in a purely asynchronous network without additional assumptions (failure detectors, timeouts, or synchronous phases). In practice, systems use timeouts and retries, which means they can only approximate exactly-once delivery.
Practical Recommendations
| Approach | Complexity | Data Integrity | Use Case | |----------|-----------|----------------|----------| | Idempotent receiver | Low | High (if designed correctly) | APIs, payment systems | | Deduplication | Medium | High | Message queues | | Two-phase commit | High | Very high | Distributed transactions | | Kafka EOS | High | High | Stream processing |
Key Takeaways
- Exactly-once delivery is theoretically impossible in asynchronous networks without additional assumptions.
- Idempotent receivers are the most practical approach — at-least-once delivery with safe retries.
- Deduplication using unique message IDs prevents duplicate processing.
- Two-phase commit provides atomicity across systems at the cost of latency and blocking risk.
- Kafka implements exactly-once semantics through idempotent producers and transactional writes.
- In practice, "exactly-once" usually means "at-least-once with idempotent processing."
Design principle: Design your receivers to be idempotent. This is simpler and more reliable than trying to prevent duplicates at the delivery layer.