Transaction Patterns
Master distributed transaction patterns — Two-Phase Commit, Saga (choreography & orchestration), and the Outbox pattern. Coordinate multi-service operations and handle failures gracefully.
Table of Contents
The Big Picture — Why Distributed Transactions Are Hard
In a monolith, a single database transaction wraps everything: create order, charge payment, reduce inventory — all succeed or all roll back. In microservices, each step lives in a different service with its own database. There's no single transaction that spans all of them. If payment succeeds but inventory fails, you're in an inconsistent state.
The Trip Booking Analogy
You're booking a trip: flight, hotel, and car rental. Each is a different company. You book the flight (confirmed), book the hotel (confirmed), try to book the car — declined. Now you need to cancel the flight and hotel. But the flight has a cancellation fee, and the hotel's cancellation window just closed. Each step is independent — there's no 'undo all' button that works across companies. That's the distributed transaction problem: coordinating multiple independent services where any step can fail, and undoing previous steps isn't always straightforward.
🔥 Key Insight
ACID transactions work within a single database. Across services, you need a different approach: either coordinate all services to commit together (2PC — slow and fragile), or accept that each step commits independently and handle failures with compensating actions (Saga — flexible and scalable).
The Coordination Problem
Place an order (3 services involved): 1. Order Service → Create order record 2. Payment Service → Charge customer's card 3. Inventory Service → Reserve items Happy path: all 3 succeed → order confirmed ✅ Failure scenarios: Step 1 succeeds, Step 2 fails → Order exists but payment failed → Must cancel the order (compensate Step 1) Steps 1-2 succeed, Step 3 fails → Order exists, payment charged, but items unavailable → Must refund payment (compensate Step 2) → Must cancel order (compensate Step 1) Step 2 succeeds but Step 2's response is lost → Did payment go through? We don't know. → Retry? Might charge twice. → Don't retry? Might lose the charge. This is why distributed transactions are hard.
The Monolith Way
BEGIN; INSERT order; charge_payment(); UPDATE inventory; COMMIT; — one transaction, all-or-nothing. Simple, correct, doesn't scale across services.
The Microservices Way
Each service has its own DB. No shared transaction. Must coordinate through messages, events, or a coordinator. Complex, but scales independently.
Two-Phase Commit (2PC)
2PC uses a central coordinator to ensure all participants either commit or abort together. It's the closest thing to a distributed ACID transaction — but it comes with severe performance and availability costs.
Coordinator manages the transaction across 3 services: PHASE 1 — PREPARE (voting): Coordinator → Order Service: "Can you commit?" → "YES" ✅ Coordinator → Payment Service: "Can you commit?" → "YES" ✅ Coordinator → Inventory Service: "Can you commit?" → "YES" ✅ All voted YES → proceed to Phase 2 PHASE 2 — COMMIT: Coordinator → Order Service: "COMMIT" → committed ✅ Coordinator → Payment Service: "COMMIT" → committed ✅ Coordinator → Inventory Service: "COMMIT" → committed ✅ Transaction complete. FAILURE SCENARIO: Phase 1: Inventory votes "NO" (out of stock) Coordinator → ALL services: "ABORT" Order Service rolls back, Payment Service rolls back. → All-or-nothing, like a local transaction. COORDINATOR FAILURE: Coordinator crashes after Phase 1 (all voted YES) but before Phase 2 → All services are BLOCKED. They voted YES but don't know if they should commit or abort. They hold locks and wait. → This is the fatal flaw of 2PC.
Strengths
- ✅Strong consistency — all-or-nothing across services
- ✅ACID-like guarantees in a distributed setting
- ✅Simple mental model (prepare → commit/abort)
- ✅Used in traditional distributed databases (Oracle RAC, XA)
Why it's rarely used in microservices
- ❌Blocking: if coordinator fails, all participants are stuck
- ❌Slow: two round trips + lock holding during both phases
- ❌Not scalable: locks held across services during the entire flow
- ❌Single point of failure: coordinator crash = system halt
- ❌Doesn't work well with heterogeneous systems (different DBs)
🎯 Interview Insight
Know 2PC to explain why it's NOT used in modern microservices. Say: "2PC provides strong consistency but it's blocking and doesn't scale. If the coordinator fails during the commit phase, all participants are stuck holding locks. In microservices, we use the Saga pattern instead — it trades strong consistency for availability and scalability."
Saga — Choreography
In choreography, there's no central coordinator. Each service listens for events and reacts. When a service completes its step, it publishes an event. The next service picks it up. If a step fails, the service publishes a failure event, and previous services execute compensating actions.
HAPPY PATH: Order Service: Create order → publish "OrderCreated" ↓ Payment Service: hears "OrderCreated" → charge card → publish "PaymentCompleted" ↓ Inventory Service: hears "PaymentCompleted" → reserve items → publish "ItemsReserved" ↓ Order Service: hears "ItemsReserved" → mark order as confirmed ✅ FAILURE PATH (inventory fails): Order Service: Create order → publish "OrderCreated" ↓ Payment Service: charge card → publish "PaymentCompleted" ↓ Inventory Service: items out of stock! → publish "ReservationFailed" ↓ Payment Service: hears "ReservationFailed" → REFUND payment → publish "PaymentRefunded" ↓ Order Service: hears "PaymentRefunded" → CANCEL order Each step has a compensating action: Create order ↔ Cancel order Charge payment ↔ Refund payment Reserve items ↔ Release reservation
Strengths
- ✅No central coordinator (no single point of failure)
- ✅Highly scalable (services are independent)
- ✅Loose coupling (services communicate via events)
- ✅Natural fit for event-driven architectures
- ✅Each service owns its own logic and compensation
Challenges
- ❌Hard to understand the full flow (events scattered across services)
- ❌Debugging is difficult (distributed tracing needed)
- ❌Cyclic dependencies possible (service A triggers B triggers A)
- ❌No single place to see the saga's current state
- ❌Compensation logic can be complex and error-prone
🎯 Interview Insight
Choreography works well when services are truly independent and the flow is simple (3-4 steps). For complex flows with many branches and conditions, orchestration is easier to manage. Say: "I'd use choreography for simple, linear flows in an event-driven system. For complex workflows with conditional logic, I'd switch to orchestration."
Saga — Orchestration
In orchestration, a central orchestrator controls the saga. It tells each service what to do, waits for the result, and decides the next step. If a step fails, the orchestrator triggers compensating actions in reverse order.
Orchestrator (Order Saga): HAPPY PATH: Step 1: Tell Order Service → "Create order" → OK ✅ Step 2: Tell Payment Service → "Charge $99.99" → OK ✅ Step 3: Tell Inventory Service → "Reserve 2 items" → OK ✅ Step 4: Tell Order Service → "Confirm order" → DONE ✅ FAILURE PATH (Step 3 fails): Step 1: Tell Order Service → "Create order" → OK ✅ Step 2: Tell Payment Service → "Charge $99.99" → OK ✅ Step 3: Tell Inventory Service → "Reserve 2 items" → FAILED ❌ Compensate Step 2: Tell Payment Service → "Refund $99.99" → OK Compensate Step 1: Tell Order Service → "Cancel order" → OK Saga failed. All steps compensated. System is consistent. The orchestrator: → Knows the full flow (steps + compensations) → Tracks the current state of the saga → Decides what to do on success or failure → Is the single source of truth for the saga's progress
Strengths
- ✅Clear, centralized flow control (easy to understand)
- ✅Saga state is tracked in one place (debuggable)
- ✅Complex conditional logic is straightforward
- ✅Easier to add new steps or change the flow
- ✅Better for workflows with many branches
Challenges
- ❌Orchestrator is a single point of control (not failure — it can be replicated)
- ❌Tighter coupling (orchestrator knows about all services)
- ❌Orchestrator can become a bottleneck at extreme scale
- ❌Risk of the orchestrator becoming a 'god service'
- ❌Must persist saga state for crash recovery
| Dimension | Choreography | Orchestration |
|---|---|---|
| Control | Decentralized (events) | Centralized (orchestrator) |
| Coupling | Loose (services don't know each other) | Tighter (orchestrator knows all services) |
| Visibility | Hard to see full flow | Full flow visible in orchestrator |
| Debugging | Difficult (distributed events) | Easier (centralized state) |
| Scalability | Higher (no central bottleneck) | Slightly lower (orchestrator) |
| Complexity | Grows with number of services | Contained in orchestrator |
| Best for | Simple flows, event-driven systems | Complex flows, many conditions |
🎯 Interview Insight
Most production systems use orchestration for complex sagas. It's easier to reason about, debug, and modify. Tools like Temporal, AWS Step Functions, and Cadence are orchestration engines. Say: "For a 3-step linear flow, choreography is fine. For an order flow with payment retries, partial fulfillment, and conditional shipping, I'd use an orchestrator."
Outbox Pattern
The Outbox pattern solves the dual-write problem: how do you reliably write to a database AND publish an event? If you write to the DB and then publish to Kafka, the publish might fail — the DB has the data but no event was sent. If you publish first and then write to the DB, the DB write might fail — the event was sent but the data doesn't exist.
THE PROBLEM: // Step 1: Write to database db.insert("orders", order); // ✅ succeeds // Step 2: Publish event to Kafka kafka.publish("OrderCreated", order); // ❌ fails (network error) // Result: order exists in DB but no event was published // Payment service never hears about the order // Customer is confused: "I placed an order but nothing happened" Reversing the order doesn't help: kafka.publish("OrderCreated", order); // ✅ succeeds db.insert("orders", order); // ❌ fails // Event published but order doesn't exist in DB! THE SOLUTION — OUTBOX PATTERN: // Single database transaction: BEGIN; INSERT INTO orders (...) VALUES (...); INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', ...); COMMIT; // Both succeed or both fail — atomic! // Background worker (separate process): // Reads outbox table → publishes to Kafka → marks as published SELECT * FROM outbox WHERE published = false; kafka.publish(event); UPDATE outbox SET published = true WHERE id = ...;
1. Application writes: BEGIN TRANSACTION; INSERT INTO orders (id, user_id, total) VALUES ('ord-123', 42, 99.99); INSERT INTO outbox (id, event_type, payload, published) VALUES ('evt-456', 'OrderCreated', '{"orderId":"ord-123",...}', false); COMMIT; → Both writes are atomic. If either fails, both roll back. 2. Outbox relay (background worker): Every 100ms: SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100; For each event: kafka.publish(event.event_type, event.payload); UPDATE outbox SET published = true WHERE id = event.id; 3. Consumers receive the event: Payment Service → hears "OrderCreated" → charges card Inventory Service → hears "OrderCreated" → reserves items 4. Cleanup: DELETE FROM outbox WHERE published = true AND created_at < NOW() - INTERVAL '7 days';
Strengths
- ✅Solves the dual-write problem (DB + event are atomic)
- ✅No data loss — event is guaranteed to be published eventually
- ✅Works with any message broker (Kafka, SQS, RabbitMQ)
- ✅Simple to implement (outbox table + polling worker)
- ✅Standard pattern in microservices (Debezium can replace the worker)
Trade-offs
- ❌Event delivery is delayed (polling interval, typically 50-200ms)
- ❌Extra table and background worker to maintain
- ❌Outbox table grows and needs cleanup
- ❌Worker must be idempotent (might publish the same event twice)
- ❌Adds complexity vs direct publish (justified for reliability)
🎯 Interview Insight
The Outbox pattern is one of the most important patterns in microservices. Whenever you say "write to DB and publish an event," the interviewer is waiting for you to address the dual-write problem. Say: "I'd use the Outbox pattern — write the event to an outbox table in the same transaction as the business data. A background relay publishes events from the outbox to Kafka. This guarantees the event is published if and only if the data was written."
End-to-End Scenario
Let's design the distributed transaction layer for an e-commerce order system.
🛒 Order System — 3 Services
Services: Order, Payment, Inventory. Each has its own database.
Requirement: if any step fails, previous steps must be compensated.
Saga Orchestrator manages the flow
An Order Saga orchestrator (could be a Temporal workflow or a state machine) controls the sequence: create order → charge payment → reserve inventory → confirm. On failure at any step, it triggers compensations in reverse.
Outbox pattern for reliable event publishing
Each service uses the Outbox pattern. When the Order Service creates an order, it writes the order AND an 'OrderCreated' event to its outbox in one transaction. The relay publishes to Kafka. No dual-write risk.
Compensating actions for failures
If Inventory fails (out of stock): orchestrator tells Payment to refund, then tells Order to cancel. Each compensation is idempotent — calling 'refund' twice doesn't double-refund. The orchestrator tracks which steps completed and which need compensation.
Idempotent services for retry safety
Every service operation is idempotent. If the orchestrator retries 'charge payment' (because the response was lost), the Payment service checks: 'Did I already charge for order-123?' If yes → return success without charging again.
Order Saga Orchestrator: │ ├── Step 1: Order Service → Create order │ └── Outbox → "OrderCreated" → Kafka │ ├── Step 2: Payment Service → Charge card │ └── Outbox → "PaymentCompleted" → Kafka │ ├── Step 3: Inventory Service → Reserve items │ └── Outbox → "ItemsReserved" → Kafka │ └── Step 4: Order Service → Confirm order Failure at Step 3: Orchestrator → Payment Service: "Refund" (compensate Step 2) Orchestrator → Order Service: "Cancel" (compensate Step 1) Each service: → Uses Outbox pattern (atomic DB write + event) → Is idempotent (safe to retry) → Has compensating actions defined
Trade-offs & Decision Making
| Pattern | Consistency | Scalability | Complexity | Blocking | Best For |
|---|---|---|---|---|---|
| 2PC | Strong (ACID-like) | Low | High | Yes (locks held) | Traditional distributed DBs |
| Saga (Choreography) | Eventual | High | High (distributed events) | No | Simple event-driven flows |
| Saga (Orchestration) | Eventual | Medium-High | Medium | No | Complex multi-step workflows |
| Outbox Pattern | Reliable delivery | High | Medium | No | Any service publishing events |
When to Use What
| Scenario | Pattern | Why |
|---|---|---|
| Simple 3-step flow, event-driven system | Saga Choreography | Services react to events, no coordinator needed |
| Complex order flow with retries and conditions | Saga Orchestration | Orchestrator manages the complexity centrally |
| Need ACID across 2 databases (same team) | 2PC | Strong consistency, acceptable if both DBs are local |
| Any service that writes to DB + publishes events | Outbox Pattern | Solves dual-write, use alongside any saga pattern |
| Microservices with independent teams | Saga + Outbox | Each team owns their service, events for coordination |
🎯 Decision Framework
Default to Saga (orchestration) + Outbox for microservices. Orchestration gives you visibility and control. Outbox gives you reliable event publishing. 2PC is for traditional distributed databases, not microservices. Choreography is for simple, linear flows where orchestration overhead isn't justified.
Interview Questions
Q:Why is 2PC not preferred in microservices?
A: 2PC is blocking: if the coordinator crashes after the prepare phase, all participants hold locks and wait indefinitely. It's slow (two round trips + lock holding). It doesn't scale (locks across services during the entire flow). It's a single point of failure. In microservices, services are independently deployed and scaled — 2PC couples them tightly. The Saga pattern provides eventual consistency without blocking, which is a better fit for microservices.
Q:Saga choreography vs orchestration — when to use each?
A: Choreography: services communicate via events, no central coordinator. Best for simple, linear flows (3-4 steps) in event-driven systems. Pros: decentralized, scalable. Cons: hard to debug, no single view of the saga. Orchestration: a central orchestrator controls the flow. Best for complex workflows with conditions, retries, and many steps. Pros: clear flow, easy to debug, centralized state. Cons: orchestrator is a point of control. Most production systems use orchestration for anything beyond trivial flows.
Q:What is the Outbox pattern and why is it important?
A: The Outbox pattern solves the dual-write problem: writing to a database AND publishing an event must both succeed or both fail. Solution: write the business data and the event to the same database in one transaction (the event goes to an 'outbox' table). A background worker reads the outbox and publishes events to Kafka. If the worker crashes, it retries — events are never lost. This is the standard way to publish events reliably in microservices.
Payment succeeds but inventory reservation fails
How do you handle this in a saga?
Answer: The saga triggers compensating actions in reverse order. The orchestrator (or the choreography event chain) tells the Payment service to refund the charge. The Order service cancels the order. Each compensation is idempotent — if the refund command is sent twice (retry), the Payment service checks if it already refunded and skips the duplicate. The customer sees: 'Sorry, items are out of stock. Your payment has been refunded.' The system is eventually consistent.
You need to create an order in the DB and notify the payment service
How do you ensure both happen reliably?
Answer: Outbox pattern. In one database transaction: INSERT the order AND INSERT an 'OrderCreated' event into the outbox table. A background relay reads unpublished events from the outbox and publishes them to Kafka. The Payment service consumes the event. If the relay crashes, it retries on restart — the event is still in the outbox. If Kafka is down, the relay retries until it succeeds. The event is published if and only if the order was created.
Pitfalls
Using 2PC in microservices
Applying Two-Phase Commit across independently deployed microservices. This couples all services to the coordinator's availability, introduces blocking, and doesn't scale. If the coordinator is down, the entire system halts.
✅Use the Saga pattern instead. Accept eventual consistency. Design compensating actions for each step. Use the Outbox pattern for reliable event publishing. 2PC is for tightly coupled distributed databases, not loosely coupled microservices.
Not implementing compensation logic
Building the happy path (create order → charge → reserve) but not the failure path. When inventory fails, the payment is never refunded. The customer is charged for items they can't receive.
✅For every forward action, define a compensating action. Create order ↔ Cancel order. Charge payment ↔ Refund payment. Reserve items ↔ Release reservation. Test the failure paths as thoroughly as the happy path. Compensation logic is not optional — it's half the saga.
Ignoring eventual consistency
Expecting immediate consistency across services after a saga step. A user places an order and immediately checks their order history — but the read model hasn't been updated yet. 'Where's my order?'
✅Design the UI for eventual consistency. Show 'Order processing...' immediately. Update the read model asynchronously. Use read-your-own-writes: after placing an order, route that user's reads to the write service briefly. Set expectations: the order confirmation page shows the order even before all services have processed it.
The dual-write problem
Writing to the database and then publishing an event as two separate operations. If the publish fails, the event is lost. If you publish first and the DB write fails, the event describes data that doesn't exist. This is the most common reliability bug in microservices.
✅Use the Outbox pattern. Write the business data and the event to the same database in one transaction. A background relay publishes events from the outbox. Alternatively, use Change Data Capture (Debezium) to stream database changes to Kafka automatically — no outbox table needed.