Managing Transactions Across Multiple Services Without Chaos
In a monolithic application, a database transaction can span multiple operations atomically. In a microservice architecture, each service has its own database. A transaction that spans services cannot use a single database transaction. This article explores the patterns for managing multi-service transactions.
Key sources: "Building Microservices" by Sam Newman, "Designing Data-Intensive Applications" by Martin Kleppmann, the Saga pattern (Hector Garcia-Molina, Kenneth Salem).
The Problem
Consider an e-commerce order flow:
- Order Service: Create an order (status: pending)
- Payment Service: Charge the customer's credit card
- Inventory Service: Reserve the items
- Shipping Service: Schedule the shipment
- Order Service: Update order status (status: confirmed)
Each step involves a different service with its own database. If step 3 fails, steps 1 and 2 must be undone. But the Order Service and Payment Service have already committed their changes.
This is the distributed transaction problem.
The Saga Pattern
The Saga pattern manages distributed transactions by breaking them into a sequence of local transactions. Each local transaction updates data within a single service. If a step fails, the saga executes compensating transactions to undo the previous steps.
Order Service: CreateOrder → OK
Payment Service: ProcessPayment → OK
Inventory Service: ReserveItems → FAIL
→ Rollback: CancelOrder (Order Service)
→ Rollback: RefundPayment (Payment Service)
Choreography-Based Saga
Each service publishes events after completing its local transaction. Other services listen for events that trigger their next step or compensating action.
Order Service → creates order → publishes "OrderCreated" event
Payment Service → listens for "OrderCreated" → processes payment → publishes "PaymentProcessed" event
Inventory Service → listens for "PaymentProcessed" → reserves items → publishes "ItemsReserved" event
Shipping Service → listens for "ItemsReserved" → schedules shipping → publishes "Shipped" event
On failure:
Inventory Service → reservation fails → publishes "ReservationFailed" event
Payment Service → listens for "ReservationFailed" → refunds payment → publishes "PaymentRefunded" event
Order Service → listens for "PaymentRefunded" → cancels order → publishes "OrderCancelled" event
Pros: Simple, no central coordinator. Services are loosely coupled.
Cons: Complex orchestration logic spread across services. Difficult to monitor and debug.
Orchestration-Based Saga
A central orchestrator tells each service what to do. The orchestrator handles the transaction logic and invokes compensating actions on failure.
class OrderOrchestrator:
def process_order(self, order):
try:
order_service.create_order(order)
payment_service.process_payment(order)
inventory_service.reserve_items(order)
shipping_service.schedule_shipment(order)
order_service.confirm_order(order)
except Exception as e:
self.rollback(order, e)
def rollback(self, order, error):
# Reverse in reverse order
shipping_service.cancel_shipment(order)
inventory_service.release_items(order)
payment_service.refund_payment(order)
order_service.cancel_order(order)
Pros: Centralized logic, easier to monitor. Clear transaction boundaries.
Cons: The orchestrator is a single point of failure and a potential bottleneck.
Two-Phase Commit (2PC)
2PC is a more rigorous approach that provides atomic commit across multiple systems:
- Prepare phase: A coordinator asks all participants if they can commit. Each participant writes their changes to a durable log and responds "yes" or "no."
- Commit phase: If all participants said "yes," the coordinator tells them all to commit. If any said "no," the coordinator tells them all to abort.
2PC works well within a single trust domain (same organization, same data center). It does not work well across organizational boundaries or high-latency networks.
Limitations: - Blocking: If the coordinator crashes after sending "prepare," participants hold locks until the coordinator recovers - Latency: Multiple round trips across all participants - Not suitable for long-running transactions
Outbox Pattern
The Outbox Pattern solves the dual-write problem: writing to a database and sending a message in the same transaction.
Problem: Service A writes to its database and sends a Kafka message. If the database write succeeds but the message send fails, other services do not know about the change. If the message sends but the database write fails, other services act on data that does not exist.
Solution: Instead of sending the message directly, the service writes the message to an "outbox" table in the same database transaction. A separate process reads the outbox table and publishes the messages.
def create_order(order):
with db.transaction():
db.execute("INSERT INTO orders ...", order)
db.execute("INSERT INTO outbox (event_type, payload) VALUES (?, ?)",
"OrderCreated", json.dumps(order))
A message relay (CDC connector, polling publisher) reads from the outbox and publishes to the message broker.
When to Use Each Pattern
| Pattern | Complexity | Consistency | Latency | Use Case | |---------|-----------|-------------|---------|----------| | Choreographed Saga | Medium | Eventual | Low | Simple workflows, fewer services | | Orchestrated Saga | High | Eventual | Medium | Complex workflows, many services | | Two-Phase Commit | High | Strong | High | Short-lived, critical transactions | | Outbox Pattern | Low | Strong (local) | Low | Avoiding dual-write problems |
Practical Advice
- Minimize distributed transactions. Design your service boundaries so that most operations touch one service's database.
- Use sagas for business transactions that span services. Accept eventual consistency.
- Use 2PC only when strong consistency is required and latency is acceptable.
- Use the Outbox Pattern whenever a service needs to write to a database AND send a message.
- Idempotent operations make rollback simpler. A compensating transaction that runs twice should be safe.
Key Takeaways
- Distributed transactions across services cannot use a single database transaction.
- The Saga pattern manages multi-step workflows with compensating rollbacks.
- Choreographed sagas (events) and orchestrated sagas (central coordinator) serve different needs.
- Two-phase commit provides strong consistency but has blocking and latency issues.
- The Outbox Pattern solves the dual-write problem (database + message broker).
- Design service boundaries to minimize the need for distributed transactions.
Design principle: Accept eventual consistency for cross-service workflows. Use compensating transactions to handle failures rather than trying to prevent them.