ENES
← BACK

Case Study

Transactional Booking & Payment Platform

Payments • Webhooks • Concurrency

Images

Gallery unavailable.

Overview

Backend for Patagonia Dreams: identity via AWS Cognito with token verification and SECRET_HASH; transactional payment and booking flows; safe export (JSON instead of CSV to avoid formula injection); URL sanitization in emails; secrets in AWS Secrets Manager; CI/CD pipeline with SAST (Semgrep) and secret scanning (Gitleaks). Tourism reservation platform in production. Multi-module, multi-tenant backoffice (partners and end customers). Payments via Mercado Pago, Stripe, Pix; webhooks as single source of truth for "reservation paid," with HMAC validation and idempotency by event_id. Stack: Django, DRF, PostgreSQL, AWS (SES, Cognito, ECR/K8s), external Panel and payment gateways.

Architecture & Design

Transactional Flow Overview

Internal System
Internal System

System Invariants

  • ·A payment intent cannot transition from failed to succeeded.
  • ·Reservation "paid" is set only after a verified webhook; frontend and redirect cannot set it.
  • ·Webhook events are processed idempotently by provider event_id.
  • ·Availability for a slot is updated under pessimistic lock (SELECT FOR UPDATE); no optimistic commit.
  • ·Idempotency keys are scoped per client and stored; duplicate key returns original response.
  • ·Payment and reservation state changes for a webhook occur in a single database transaction.

Scale & Constraints

Request volume
~2k reservations/month; webhook bursts up to ~50/min on peak.
Concurrency
Pessimistic lock on availability row per slot; single writer for payment state. No cross-slot locking.
External dependencies
Mercado Pago, Stripe, Pix; AWS Cognito, SES; external Panel (activities, rates, blocks). Webhooks are async; payment status only via webhook.
Failure modes
Provider timeout or webhook delay → reservation stays pending until webhook or manual reconciliation. Duplicate webhook → idempotent by event_id. Cognito/Panel down → degraded auth or catalog sync.
Data consistency
Single DB transaction for reservation + payment on webhook. Reservation "paid" only after webhook; frontend cannot set paid. Cognito ↔ Django user sync via get_or_create and ID token verification.

What was explicitly rejected

  • Frontend or redirect callback as source of "paid". Redirects and client state are unreliable; provider retries and multiple tabs would allow double-apply or missed updates.
  • Optimistic locking on availability. Conflict rate on hot slots would cause high retry and poor UX; pessimistic lock gave predictable behaviour at observed load.
  • Microservices per domain (payments, reservations, catalog). Operational and consistency cost (distributed transactions, eventual consistency) not justified for current scale; modular monolith with clear boundaries chosen instead.
  • CSV export for operations. Excel/CSV formula injection risk; replaced with JSON response and controlled data only.
  • Secrets or sensitive URLs in code or repo. All critical config (FRONTEND_URL, Cognito, Stripe, Panel, etc.) via env from AWS Secrets Manager.

What would break this system?

  • Single DB or replica failure: all reservations and payment state in one store; no automatic failover.
  • Webhook delivery stopped (provider or our endpoint): reservations stay pending indefinitely; no path to "paid".
  • Lock contention on hot slots: SELECT FOR UPDATE serializes; at higher concurrency wait times and timeouts grow.
  • Idempotency key table unbounded growth: cleanup fails or is delayed → table bloat and slower lookups.
  • Mercado Pago, Stripe, and Pix all degraded: no path to confirm payment; business stops.
  • Catalog sync provider wrong or down: stale inventory; overbooking if external availability is authoritative.
  • Cognito unavailable: no signup/login or token refresh; identity is single point of entry.
  • Secrets Manager or env misconfiguration: auth or payment integrations fail at runtime.

Deep dive

Identity & auth (trust boundaries)

AWS Cognito is the single entry point for identity: signup, email confirmation, login with user/password, token refresh, and OAuth Authorization Code callback for Hosted UI.

Bidirectional sync Cognito ↔ Django user: get_or_create by email, unique username generation on collision, update of names and active status from verified ID token (JWKS, issuer, audience, exp). SECRET_HASH is used correctly in all Cognito calls that require it (sign_up, confirm_sign_up, authenticate, refresh_token) for confidential clients, avoiding production config errors.

ID token is verified with JWKS (RS256, issuer, audience) before trusting any user data; without verification we do not create or update the local user.

Payment and transactional flows

Payment and booking confirmation flows: transactional emails (AWS SES) with booking data, passengers, activities, and secure links; templates parameterized only with controlled context (booking, activities).

Integration with external Panel (activities, rates, blocks) and mapping Panel activity ↔ local activity, with data export for operations and reporting.

Atomic transactions (transaction.atomic()) on reservation creation/update and mappings to keep consistency under failures or concurrency.

Security and trust boundaries

Injection vectors removed: CSV export replaced by JSON response to avoid Excel/CSV formula injection; URL validation in email templates (http_url filter: only http/https) to prevent XSS via javascript: in href.

Secrets out of code: critical config (FRONTEND_URL, WHATSAPP_NUMBER, social URLs, Cognito, Stripe, Panel, etc.) via environment variables from AWS Secrets Manager; no sensitive values in repo.

CI/CD and security: pipeline with Semgrep (SAST), Gitleaks, pip-audit, Trivy (Docker image); no direct push to production branch; migration and dependency review before deploy.

Concurrency and correctness

Idempotency and uniqueness: get_or_create and "single record" logic (e.g. Layouts, mappings) to avoid duplicates and race conditions on writes.

Explicit transactions in flows that touch multiple models (reservation + user + notifications) to guarantee all-or-nothing and consistency on failure.

Integration with external APIs (Panel, Cognito) with timeouts and error handling so the process does not block and we do not trust malformed responses.

Idempotency in distributed payments

Payment providers and webhooks deliver at least once. Retries, network partitions, and client double-submits make duplicate events the norm. Idempotency is implemented at two distinct layers: client-initiated operations (e.g. reservation creation) and server-driven events (webhooks).

For client operations we use an idempotency key (supplied by the client or derived from a deterministic hash of intent). The key is the sole lookup for the stored outcome; the first request executes and persists the result, subsequent requests return the stored result without re-executing. Key design: store outcome (success/failure + response payload or error code), not just "seen". That allows safe replay with correct semantics.

For webhooks we key on event_id (or provider-side id). The same event_id may be delivered multiple times; we apply the state transition once and ignore duplicates. Critical: the idempotency check and the state update (e.g. mark reservation paid) live in the same transaction so we never double-apply under concurrency. Key expiry and cleanup policies prevent unbounded growth while retaining keys long enough to cover provider retry windows (e.g. 24–72h).

Atomic state transitions and race conditions

Double-booking occurs when two concurrent requests both read "available" and then both commit a booking. The fix is a single writer and serialisation at the consistency boundary. We use SELECT FOR UPDATE on the availability row (or the aggregate that owns it) inside the same transaction that creates the reservation. The second request blocks until the first commits or rolls back; it then sees updated state and either succeeds on remaining capacity or fails consistently.

When the transition spans two stores (e.g. payment record and reservation), we keep them in one DB transaction where both tables live in the same database. Commit creates the payment row and updates the reservation in one atomic step. When payment is external (provider webhook), we do not have a single distributed transaction—we treat the webhook as the source of truth for "paid" and update our reservation in one local transaction keyed by idempotent event_id; the only writer for that transition is the webhook handler.

We avoid saga-style compensating transactions for the core path: they add complexity and new failure modes. Where we must coordinate across services, we use an outbox or single write that triggers downstream work, with idempotent consumers so duplicate events do not double-apply.