2PCSagaChoreographyOrchestrationOutbox PatternDistributed TransactionsCompensation

Transaction Patterns

Master distributed transaction patterns — Two-Phase Commit, Saga (choreography & orchestration), and the Outbox pattern. Coordinate multi-service operations and handle failures gracefully.

26 min read10 sections
01

The Big Picture — Why Distributed Transactions Are Hard

In a monolith, a single database transaction wraps everything: create order, charge payment, reduce inventory — all succeed or all roll back. In microservices, each step lives in a different service with its own database. There's no single transaction that spans all of them. If payment succeeds but inventory fails, you're in an inconsistent state.

✈️

The Trip Booking Analogy

You're booking a trip: flight, hotel, and car rental. Each is a different company. You book the flight (confirmed), book the hotel (confirmed), try to book the car — declined. Now you need to cancel the flight and hotel. But the flight has a cancellation fee, and the hotel's cancellation window just closed. Each step is independent — there's no 'undo all' button that works across companies. That's the distributed transaction problem: coordinating multiple independent services where any step can fail, and undoing previous steps isn't always straightforward.

🔥 Key Insight

ACID transactions work within a single database. Across services, you need a different approach: either coordinate all services to commit together (2PC — slow and fragile), or accept that each step commits independently and handle failures with compensating actions (Saga — flexible and scalable).

02

The Coordination Problem

The Problem — Multi-Service Operationtext
Place an order (3 services involved):

  1. Order ServiceCreate order record
  2. Payment ServiceCharge customer's card
  3. Inventory ServiceReserve items

Happy path: all 3 succeedorder confirmed

Failure scenarios:
  Step 1 succeeds, Step 2 fails
Order exists but payment failed
Must cancel the order (compensate Step 1)

  Steps 1-2 succeed, Step 3 fails
Order exists, payment charged, but items unavailable
Must refund payment (compensate Step 2)
Must cancel order (compensate Step 1)

  Step 2 succeeds but Step 2's response is lost
Did payment go through? We don't know.
Retry? Might charge twice.
Don't retry? Might lose the charge.

This is why distributed transactions are hard.
🏢

The Monolith Way

BEGIN; INSERT order; charge_payment(); UPDATE inventory; COMMIT; — one transaction, all-or-nothing. Simple, correct, doesn't scale across services.

🔧

The Microservices Way

Each service has its own DB. No shared transaction. Must coordinate through messages, events, or a coordinator. Complex, but scales independently.

03

Two-Phase Commit (2PC)

2PC uses a central coordinator to ensure all participants either commit or abort together. It's the closest thing to a distributed ACID transaction — but it comes with severe performance and availability costs.

2PC — How It Workstext
Coordinator manages the transaction across 3 services:

PHASE 1PREPARE (voting):
  CoordinatorOrder Service:     "Can you commit?""YES"
  CoordinatorPayment Service:   "Can you commit?""YES"
  CoordinatorInventory Service: "Can you commit?""YES"

  All voted YESproceed to Phase 2

PHASE 2COMMIT:
  CoordinatorOrder Service:     "COMMIT"committed
  CoordinatorPayment Service:   "COMMIT"committed
  CoordinatorInventory Service: "COMMIT"committed

  Transaction complete.

FAILURE SCENARIO:
  Phase 1: Inventory votes "NO" (out of stock)
  CoordinatorALL services: "ABORT"
  Order Service rolls back, Payment Service rolls back.
All-or-nothing, like a local transaction.

COORDINATOR FAILURE:
  Coordinator crashes after Phase 1 (all voted YES) but before Phase 2
All services are BLOCKED. They voted YES but don't know if they
     should commit or abort. They hold locks and wait.
This is the fatal flaw of 2PC.

Strengths

  • Strong consistency — all-or-nothing across services
  • ACID-like guarantees in a distributed setting
  • Simple mental model (prepare → commit/abort)
  • Used in traditional distributed databases (Oracle RAC, XA)

Why it's rarely used in microservices

  • Blocking: if coordinator fails, all participants are stuck
  • Slow: two round trips + lock holding during both phases
  • Not scalable: locks held across services during the entire flow
  • Single point of failure: coordinator crash = system halt
  • Doesn't work well with heterogeneous systems (different DBs)

🎯 Interview Insight

Know 2PC to explain why it's NOT used in modern microservices. Say: "2PC provides strong consistency but it's blocking and doesn't scale. If the coordinator fails during the commit phase, all participants are stuck holding locks. In microservices, we use the Saga pattern instead — it trades strong consistency for availability and scalability."

04

Saga — Choreography

In choreography, there's no central coordinator. Each service listens for events and reacts. When a service completes its step, it publishes an event. The next service picks it up. If a step fails, the service publishes a failure event, and previous services execute compensating actions.

Saga Choreography — Event-Driven Flowtext
HAPPY PATH:
  Order Service: Create orderpublish "OrderCreated"

  Payment Service: hears "OrderCreated"charge card
publish "PaymentCompleted"

  Inventory Service: hears "PaymentCompleted"reserve items
publish "ItemsReserved"

  Order Service: hears "ItemsReserved"mark order as confirmed

FAILURE PATH (inventory fails):
  Order Service: Create orderpublish "OrderCreated"

  Payment Service: charge cardpublish "PaymentCompleted"

  Inventory Service: items out of stock!
publish "ReservationFailed"

  Payment Service: hears "ReservationFailed"REFUND payment
publish "PaymentRefunded"

  Order Service: hears "PaymentRefunded"CANCEL order

Each step has a compensating action:
  Create orderCancel order
  Charge paymentRefund payment
  Reserve itemsRelease reservation

Strengths

  • No central coordinator (no single point of failure)
  • Highly scalable (services are independent)
  • Loose coupling (services communicate via events)
  • Natural fit for event-driven architectures
  • Each service owns its own logic and compensation

Challenges

  • Hard to understand the full flow (events scattered across services)
  • Debugging is difficult (distributed tracing needed)
  • Cyclic dependencies possible (service A triggers B triggers A)
  • No single place to see the saga's current state
  • Compensation logic can be complex and error-prone

🎯 Interview Insight

Choreography works well when services are truly independent and the flow is simple (3-4 steps). For complex flows with many branches and conditions, orchestration is easier to manage. Say: "I'd use choreography for simple, linear flows in an event-driven system. For complex workflows with conditional logic, I'd switch to orchestration."

05

Saga — Orchestration

In orchestration, a central orchestrator controls the saga. It tells each service what to do, waits for the result, and decides the next step. If a step fails, the orchestrator triggers compensating actions in reverse order.

Saga Orchestration — Centrally Controlledtext
Orchestrator (Order Saga):

HAPPY PATH:
  Step 1: Tell Order Service"Create order"OK
  Step 2: Tell Payment Service"Charge $99.99"OK
  Step 3: Tell Inventory Service"Reserve 2 items"OK
  Step 4: Tell Order Service"Confirm order"DONE

FAILURE PATH (Step 3 fails):
  Step 1: Tell Order Service"Create order"OK
  Step 2: Tell Payment Service"Charge $99.99"OK
  Step 3: Tell Inventory Service"Reserve 2 items"FAILED
  
  Compensate Step 2: Tell Payment Service"Refund $99.99"OK
  Compensate Step 1: Tell Order Service"Cancel order"OK
  
  Saga failed. All steps compensated. System is consistent.

The orchestrator:
Knows the full flow (steps + compensations)
Tracks the current state of the saga
Decides what to do on success or failure
Is the single source of truth for the saga's progress

Strengths

  • Clear, centralized flow control (easy to understand)
  • Saga state is tracked in one place (debuggable)
  • Complex conditional logic is straightforward
  • Easier to add new steps or change the flow
  • Better for workflows with many branches

Challenges

  • Orchestrator is a single point of control (not failure — it can be replicated)
  • Tighter coupling (orchestrator knows about all services)
  • Orchestrator can become a bottleneck at extreme scale
  • Risk of the orchestrator becoming a 'god service'
  • Must persist saga state for crash recovery
DimensionChoreographyOrchestration
ControlDecentralized (events)Centralized (orchestrator)
CouplingLoose (services don't know each other)Tighter (orchestrator knows all services)
VisibilityHard to see full flowFull flow visible in orchestrator
DebuggingDifficult (distributed events)Easier (centralized state)
ScalabilityHigher (no central bottleneck)Slightly lower (orchestrator)
ComplexityGrows with number of servicesContained in orchestrator
Best forSimple flows, event-driven systemsComplex flows, many conditions

🎯 Interview Insight

Most production systems use orchestration for complex sagas. It's easier to reason about, debug, and modify. Tools like Temporal, AWS Step Functions, and Cadence are orchestration engines. Say: "For a 3-step linear flow, choreography is fine. For an order flow with payment retries, partial fulfillment, and conditional shipping, I'd use an orchestrator."

06

Outbox Pattern

The Outbox pattern solves the dual-write problem: how do you reliably write to a database AND publish an event? If you write to the DB and then publish to Kafka, the publish might fail — the DB has the data but no event was sent. If you publish first and then write to the DB, the DB write might fail — the event was sent but the data doesn't exist.

The Dual-Write Problemtext
THE PROBLEM:
  // Step 1: Write to database
  db.insert("orders", order);  // ✅ succeeds
  
  // Step 2: Publish event to Kafka
  kafka.publish("OrderCreated", order);  // ❌ fails (network error)
  
  // Result: order exists in DB but no event was published
  // Payment service never hears about the order
  // Customer is confused: "I placed an order but nothing happened"

  Reversing the order doesn't help:
  kafka.publish("OrderCreated", order);  // ✅ succeeds
  db.insert("orders", order);            // ❌ fails
  // Event published but order doesn't exist in DB!

THE SOLUTIONOUTBOX PATTERN:
  // Single database transaction:
  BEGIN;
    INSERT INTO orders (...) VALUES (...);
    INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', ...);
  COMMIT;
  // Both succeed or both fail — atomic!

  // Background worker (separate process):
  // Reads outbox table → publishes to Kafka → marks as published
  SELECT * FROM outbox WHERE published = false;
  kafka.publish(event);
  UPDATE outbox SET published = true WHERE id = ...;
Outbox Pattern — Full Flowtext
1. Application writes:
   BEGIN TRANSACTION;
     INSERT INTO orders (id, user_id, total) VALUES ('ord-123', 42, 99.99);
     INSERT INTO outbox (id, event_type, payload, published)
       VALUES ('evt-456', 'OrderCreated', '{"orderId":"ord-123",...}', false);
   COMMIT;
Both writes are atomic. If either fails, both roll back.

2. Outbox relay (background worker):
   Every 100ms:
     SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100;
     For each event:
       kafka.publish(event.event_type, event.payload);
       UPDATE outbox SET published = true WHERE id = event.id;

3. Consumers receive the event:
   Payment Servicehears "OrderCreated"charges card
   Inventory Servicehears "OrderCreated"reserves items

4. Cleanup:
   DELETE FROM outbox WHERE published = true AND created_at < NOW() - INTERVAL '7 days';

Strengths

  • Solves the dual-write problem (DB + event are atomic)
  • No data loss — event is guaranteed to be published eventually
  • Works with any message broker (Kafka, SQS, RabbitMQ)
  • Simple to implement (outbox table + polling worker)
  • Standard pattern in microservices (Debezium can replace the worker)

Trade-offs

  • Event delivery is delayed (polling interval, typically 50-200ms)
  • Extra table and background worker to maintain
  • Outbox table grows and needs cleanup
  • Worker must be idempotent (might publish the same event twice)
  • Adds complexity vs direct publish (justified for reliability)

🎯 Interview Insight

The Outbox pattern is one of the most important patterns in microservices. Whenever you say "write to DB and publish an event," the interviewer is waiting for you to address the dual-write problem. Say: "I'd use the Outbox pattern — write the event to an outbox table in the same transaction as the business data. A background relay publishes events from the outbox to Kafka. This guarantees the event is published if and only if the data was written."

07

End-to-End Scenario

Let's design the distributed transaction layer for an e-commerce order system.

🛒 Order System — 3 Services

Services: Order, Payment, Inventory. Each has its own database.

Requirement: if any step fails, previous steps must be compensated.

1

Saga Orchestrator manages the flow

An Order Saga orchestrator (could be a Temporal workflow or a state machine) controls the sequence: create order → charge payment → reserve inventory → confirm. On failure at any step, it triggers compensations in reverse.

2

Outbox pattern for reliable event publishing

Each service uses the Outbox pattern. When the Order Service creates an order, it writes the order AND an 'OrderCreated' event to its outbox in one transaction. The relay publishes to Kafka. No dual-write risk.

3

Compensating actions for failures

If Inventory fails (out of stock): orchestrator tells Payment to refund, then tells Order to cancel. Each compensation is idempotent — calling 'refund' twice doesn't double-refund. The orchestrator tracks which steps completed and which need compensation.

4

Idempotent services for retry safety

Every service operation is idempotent. If the orchestrator retries 'charge payment' (because the response was lost), the Payment service checks: 'Did I already charge for order-123?' If yes → return success without charging again.

Architecture — Saga + Outboxtext
Order Saga Orchestrator:

  ├── Step 1: Order ServiceCreate order
  │   └── Outbox"OrderCreated"Kafka

  ├── Step 2: Payment ServiceCharge card
  │   └── Outbox"PaymentCompleted"Kafka

  ├── Step 3: Inventory ServiceReserve items
  │   └── Outbox"ItemsReserved"Kafka

  └── Step 4: Order ServiceConfirm order

Failure at Step 3:
  OrchestratorPayment Service: "Refund" (compensate Step 2)
  OrchestratorOrder Service: "Cancel" (compensate Step 1)

Each service:
Uses Outbox pattern (atomic DB write + event)
Is idempotent (safe to retry)
Has compensating actions defined
08

Trade-offs & Decision Making

PatternConsistencyScalabilityComplexityBlockingBest For
2PCStrong (ACID-like)LowHighYes (locks held)Traditional distributed DBs
Saga (Choreography)EventualHighHigh (distributed events)NoSimple event-driven flows
Saga (Orchestration)EventualMedium-HighMediumNoComplex multi-step workflows
Outbox PatternReliable deliveryHighMediumNoAny service publishing events

When to Use What

ScenarioPatternWhy
Simple 3-step flow, event-driven systemSaga ChoreographyServices react to events, no coordinator needed
Complex order flow with retries and conditionsSaga OrchestrationOrchestrator manages the complexity centrally
Need ACID across 2 databases (same team)2PCStrong consistency, acceptable if both DBs are local
Any service that writes to DB + publishes eventsOutbox PatternSolves dual-write, use alongside any saga pattern
Microservices with independent teamsSaga + OutboxEach team owns their service, events for coordination

🎯 Decision Framework

Default to Saga (orchestration) + Outbox for microservices. Orchestration gives you visibility and control. Outbox gives you reliable event publishing. 2PC is for traditional distributed databases, not microservices. Choreography is for simple, linear flows where orchestration overhead isn't justified.

09

Interview Questions

Q:Why is 2PC not preferred in microservices?

A: 2PC is blocking: if the coordinator crashes after the prepare phase, all participants hold locks and wait indefinitely. It's slow (two round trips + lock holding). It doesn't scale (locks across services during the entire flow). It's a single point of failure. In microservices, services are independently deployed and scaled — 2PC couples them tightly. The Saga pattern provides eventual consistency without blocking, which is a better fit for microservices.

Q:Saga choreography vs orchestration — when to use each?

A: Choreography: services communicate via events, no central coordinator. Best for simple, linear flows (3-4 steps) in event-driven systems. Pros: decentralized, scalable. Cons: hard to debug, no single view of the saga. Orchestration: a central orchestrator controls the flow. Best for complex workflows with conditions, retries, and many steps. Pros: clear flow, easy to debug, centralized state. Cons: orchestrator is a point of control. Most production systems use orchestration for anything beyond trivial flows.

Q:What is the Outbox pattern and why is it important?

A: The Outbox pattern solves the dual-write problem: writing to a database AND publishing an event must both succeed or both fail. Solution: write the business data and the event to the same database in one transaction (the event goes to an 'outbox' table). A background worker reads the outbox and publishes events to Kafka. If the worker crashes, it retries — events are never lost. This is the standard way to publish events reliably in microservices.

1

Payment succeeds but inventory reservation fails

How do you handle this in a saga?

Answer: The saga triggers compensating actions in reverse order. The orchestrator (or the choreography event chain) tells the Payment service to refund the charge. The Order service cancels the order. Each compensation is idempotent — if the refund command is sent twice (retry), the Payment service checks if it already refunded and skips the duplicate. The customer sees: 'Sorry, items are out of stock. Your payment has been refunded.' The system is eventually consistent.

2

You need to create an order in the DB and notify the payment service

How do you ensure both happen reliably?

Answer: Outbox pattern. In one database transaction: INSERT the order AND INSERT an 'OrderCreated' event into the outbox table. A background relay reads unpublished events from the outbox and publishes them to Kafka. The Payment service consumes the event. If the relay crashes, it retries on restart — the event is still in the outbox. If Kafka is down, the relay retries until it succeeds. The event is published if and only if the order was created.

10

Pitfalls

🔒

Using 2PC in microservices

Applying Two-Phase Commit across independently deployed microservices. This couples all services to the coordinator's availability, introduces blocking, and doesn't scale. If the coordinator is down, the entire system halts.

Use the Saga pattern instead. Accept eventual consistency. Design compensating actions for each step. Use the Outbox pattern for reliable event publishing. 2PC is for tightly coupled distributed databases, not loosely coupled microservices.

↩️

Not implementing compensation logic

Building the happy path (create order → charge → reserve) but not the failure path. When inventory fails, the payment is never refunded. The customer is charged for items they can't receive.

For every forward action, define a compensating action. Create order ↔ Cancel order. Charge payment ↔ Refund payment. Reserve items ↔ Release reservation. Test the failure paths as thoroughly as the happy path. Compensation logic is not optional — it's half the saga.

Ignoring eventual consistency

Expecting immediate consistency across services after a saga step. A user places an order and immediately checks their order history — but the read model hasn't been updated yet. 'Where's my order?'

Design the UI for eventual consistency. Show 'Order processing...' immediately. Update the read model asynchronously. Use read-your-own-writes: after placing an order, route that user's reads to the write service briefly. Set expectations: the order confirmation page shows the order even before all services have processed it.

✌️

The dual-write problem

Writing to the database and then publishing an event as two separate operations. If the publish fails, the event is lost. If you publish first and the DB write fails, the event describes data that doesn't exist. This is the most common reliability bug in microservices.

Use the Outbox pattern. Write the business data and the event to the same database in one transaction. A background relay publishes events from the outbox. Alternatively, use Change Data Capture (Debezium) to stream database changes to Kafka automatically — no outbox table needed.