ENES
← BACK

Case Study

Idempotent Payment Orchestrator

Backend Architecture • Transactional Systems

Images

Gallery unavailable.

Overview

Designed and implemented a retry-safe payment processing pipeline with strict idempotency guarantees under concurrent transaction submissions. Idempotency keys at request boundary; outbox for provider calls; webhook reconciliation by event_id. Guarantee: no double charge under client retries, duplicate webhooks, or network failure between DB commit and provider call.

Client
   ↓
API (Django)
   ↓
PostgreSQL (transactions + idempotency)
   ↓
Outbox Table
   ↓
Worker (Celery)
   ↓
Payment Provider

Architecture & Design

Transactional Flow Overview

Internal System
Internal System

System Invariants

  • ·A payment intent cannot transition from failed to succeeded.
  • ·Reservation "paid" is set only after a verified webhook; frontend and redirect cannot set it.
  • ·Webhook events are processed idempotently by provider event_id.
  • ·Availability for a slot is updated under pessimistic lock (SELECT FOR UPDATE); no optimistic commit.
  • ·Idempotency keys are scoped per client and stored; duplicate key returns original response.
  • ·Payment and reservation state changes for a webhook occur in a single database transaction.

Scale & Constraints

Request volume
Client and provider retries; requests and webhooks can arrive duplicated or out of order.
Concurrency
Single writer per idempotency key; outbox for provider calls. No double charge under retries.
External dependencies
Payment provider API; webhooks. Network failures between DB commit and provider call possible.
Failure modes
Provider timeout or unreachable after commit → outbox retry. Duplicate webhook → idempotent by event_id. Client retry → same key returns stored outcome.
Data consistency
Payment state and outbox in same DB; commit before provider call or outbox. No double charge; idempotency key is sole source of outcome for request.

What was explicitly rejected

  • Simple request-based processing without idempotency keys. Retries and duplicate submissions would cause double charge; key at business layer is required.
  • Handling retries only at HTTP layer. Application state can still double-apply; idempotency must be enforced at orchestration layer with a stable key.
  • Relying entirely on provider guarantees. Provider semantics vary and may not guarantee exactly-once; we own the no-double-charge guarantee.
  • Processing side effects inside request lifecycle. If process dies after DB commit but before provider call, state is inconsistent; outbox decouples and allows retry without re-executing request.

What would break this system?

  • Outbox worker stopped: payments committed in DB never reach provider; state stuck, money never charged.
  • Provider accepts charge but never sends webhook: we may never mark succeeded; reconciliation depends on manual or batch check.
  • Idempotency key reused for different intents: wrong outcome returned; key must be per intent.
  • Provider eventually consistent: we mark paid on webhook; provider may still show pending; read-your-writes violation for downstream.
  • Worker retries outbox row without provider idempotency: double charge if provider does not deduplicate by our key.

Deep dive

Idempotency Strategy

Every payment initiation request must carry an idempotency key (client-supplied or derived from intent). The key is the sole lookup for the stored outcome. First request with a given key creates the payment row and runs the state machine; subsequent requests with the same key return the stored outcome without re-executing. We store outcome (success/failure plus response or error code), not just "seen", so replay returns the same result.

A unique constraint on (client_id, idempotency_key) in the database enforces one row per key. Concurrent requests with the same key: one inserts and proceeds; others hit the constraint and either retry the read or treat as duplicate. No application-level lock required; the constraint is the serialisation point.

Payment state is a status-based state machine (e.g. pending → charge_requested → provider_called → succeeded, or failed). Transitions are deterministic and stored in one transaction. Same key always yields same terminal state; we never transition from failed to succeeded or create a second charge.

Transaction Boundaries

Each state transition is one atomic database transaction. We insert or update the payment row and, when we need to call the provider, insert an outbox row in the same transaction. Commit happens before any HTTP call to the provider. If we committed and then called the provider in the same process, a crash after commit but before the call would leave our DB updated but the provider never called; the outbox row ensures a background worker will perform the call later.

The external provider call must be outside the commit because the provider is not part of the transaction. We cannot roll back a provider charge if our commit fails. So we never do: commit then call provider in request. We do: commit (state + outbox row) then return; worker calls provider and updates state in a separate transaction.

Double execution is prevented by the idempotency key at the request boundary (same key → same stored outcome) and by the unique constraint (one row per key). The worker dispatches each outbox row at most once in practice; if it retries, the provider call uses the same idempotency key so the provider does not double-charge.

Outbox Pattern Implementation

We use a dedicated outbox table: columns include id, payment_id, payload, status (pending/processed/failed), created_at, processed_at. When we transition payment to "charge requested", we insert a row into the outbox in the same transaction. No other side effects run in that request.

A background worker polls the outbox for pending rows (or is notified by a queue). It loads the row, calls the payment provider with the payload and idempotency key, and on success marks the row processed and updates the payment state in one transaction. On provider failure or timeout it leaves the row pending and retries with backoff.

We do not guarantee that the provider is called in the same second as the commit; we guarantee that every committed outbox row is eventually processed. The worker retries until the provider accepts or we mark failed after a threshold. Local state (payment + outbox) is consistent after each transaction; provider state catches up when the worker succeeds. That is eventual consistency between our DB and the provider.

Webhook Reconciliation

We persist every webhook event in a table keyed by provider event_id (or equivalent). Before applying any transition we check whether that event_id is already stored; if so we skip (deduplicate). If not we apply the transition (e.g. payment succeeded) and store the event_id in the same transaction. Duplicate webhooks for the same event_id are no-ops.

Deduplication is by provider event_id only. We do not key by our internal id; the provider can send the same event multiple times. First occurrence wins; later ones are ignored. Order of arrival does not change the outcome because the state machine is deterministic and we only move forward (e.g. pending → succeeded; we never overwrite succeeded with failed).

We reconcile webhook-derived state with the local state machine. If the worker already updated state from a successful provider call, the webhook may be redundant; we still store the event_id and treat as idempotent duplicate. If the webhook arrives before the worker completes, we update from the webhook and when the worker runs it sees the payment already succeeded (or we mark outbox as reconciled). Reconciliation job can compare our state to provider state for aged pending items and alert or retry.

Failure Mode Analysis

Crash after DB commit but before provider call: we never call the provider in the request; we only write the outbox row. After restart the worker picks up the pending row and dispatches. The provider is called once when the worker runs. No double charge because only the worker performs the call and it marks the row processed after success.

Provider returns success but client times out: the client may retry with the same idempotency key. We return the stored outcome (success) without re-executing. The charge already happened; the retry is a read. If the client never got the response we have already persisted success and may have written the outbox row; the worker will not send a second charge because we do not insert a second outbox row for the same payment.

Worker crash during dispatch: the worker calls the provider then must mark the outbox row processed in the same or a follow-up transaction. If the worker crashes after the provider call but before marking processed, on restart it will retry the same row. The provider receives a second call with the same idempotency key; the provider must deduplicate (return success without double-charging). So we rely on provider idempotency for this case. Alternatively we mark "dispatching" before the call and only set "processed" after; retries skip rows already in "dispatching" for longer than a timeout.