Case Study

Municipal Unified Identity Platform

Identity • Trust Boundaries • RBAC

Images

Gallery unavailable.

Overview

Centralized authentication gateway for a municipality: citizens authenticate once and access multiple government services with a single token. Identity is validated against national registries (Mi Argentina, RENAPER, AFIP) on every login; the gateway is the only component that calls those APIs and the only issuer of session tokens. Legacy systems consume signed tokens and enforce RBAC; they do not re-authenticate. No PII in tokens; fail safe when national APIs are unavailable. Audit and RBAC at gateway and service layer.

Architecture & Design

Identity Trust Boundary Flow

Internal System

System Invariants

·Tokens must not contain PII; only claims required for authorization (sub, roles, exp).
·"Verified" is never issued when verification against national APIs did not succeed.
·Only the gateway calls national APIs and issues tokens; services validate tokens only.
·Session tokens are short-lived; re-validation on every login.
·RBAC is enforced at gateway (route) and at service (resource); no bypass.
·All authentication and token issuance events are logged for audit.

Scale & Constraints

Request volume: ~45k logins/month; token validation on every request to downstream services.
Concurrency: Gateway is single writer for tokens; services are read-only validators. No distributed lock; stateless validation.
External dependencies: Mi Argentina, RENAPER, AFIP. Login depends on at least one being available; degraded mode (unverified session or reject) when all are down.
Failure modes: National APIs down or slow → degraded mode or login failure; no "verified" issued without verification. Token validation failure → 401; no fallback to legacy auth.
Data consistency: Session and verification state only in gateway; tokens are signed assertions. Services do not persist identity state; they validate and apply RBAC per request.

What was explicitly rejected

Each legacy system calling national APIs and issuing its own tokens. Would duplicate integration, PII exposure, and failure modes; single gateway gives one trust boundary and one place to fail safe.
PII or raw registry data in tokens. Blast radius and compliance; tokens are minimal claims (sub, roles, exp) so compromise of a service does not leak registry data.
Long-lived tokens with no re-validation. Verification must reflect current state; every login re-validates against national APIs so "verified" cannot become stale.

What would break this system?

•Gateway down: no one logs in; single point of failure for all services.
•RENAPER, AFIP, Mi Argentina all unavailable: only unverified sessions or login failure; no degradation that preserves "verified".
•Token signing key compromise: all tokens forgeable until rotation; services must reject old key and all sessions invalidated.
•DB holding session/audit state lost: session revocation and audit trail gap; no point-in-time recovery of who had access.
•Sudden 10x login spike: national APIs and gateway become bottleneck; external dependencies do not scale with us.
•Legacy service skips RBAC or misvalidates token: authorization bypass; boundary is only as strong as the weakest consumer.

Deep dive

Token design and trust boundaries

The gateway is the only component that calls identity providers (national APIs, etc.) and the only issuer of session tokens. Downstream services validate tokens and enforce RBAC; they never re-authenticate. That defines a clear trust boundary: everything behind the gateway trusts the gateway's issuance and treats the token as the authority for identity and claims.

Tokens carry minimal claims: identity id, roles, scope, expiry. No PII, no raw registry data. That limits blast radius on token leak and keeps compliance boundaries clear (PII stays in the system that owns it). We use signed tokens (e.g. JWT with HMAC or asymmetric signing); validators verify signature and expiry and reject anything else. Opaque tokens with a server-side lookup are an alternative when revocation must be immediate and global.

Revocation is handled at the gateway (session invalidation, logout). Downstream services rely on short-lived tokens or periodic re-validation if strict "logout everywhere" is required without a shared revocation store.

Audit logging strategy

We log state-changing actions with who (actor id or service), what (action type, resource id), when (timestamp), and enough context to reproduce (e.g. idempotency key, event_id, id of created/updated entity). Logs are append-only and immutable; no in-place edits. That supports compliance and post-incident analysis.

Structured fields (JSON or key-value) allow querying by correlation_id, request_id, or user_id. Correlation IDs are propagated across service boundaries so a single payment or login can be traced from gateway through to DB write. Retention is policy-driven: short for noisy debug logs, longer for audit and payment-related events.

Sensitive data is not logged in plain text; we log identifiers and event types, not full PII or card data. Audit logs are written synchronously in the critical path so we do not lose events on crash; we keep the payload small and the write fast (e.g. to a dedicated table or log stream).

Observability and failure detection

Health checks are split: liveness (process up) vs readiness (dependencies acceptable). A service that cannot reach the DB or an identity provider should fail readiness so the orchestrator does not send traffic until it recovers. We avoid marking healthy when we cannot fulfill requests.

We instrument payment and identity flows with metrics: latency (p50/p99), error rate by outcome (e.g. success, idempotent duplicate, validation failure), and idempotency hit rate. Alerts fire on elevated error rate, dependency failures (e.g. national API down), and payment webhook processing failures. Dashboards show success vs duplicate vs failure so we can distinguish retries from real regressions.

Distributed tracing (trace_id across services) ties a request from API through queue and DB. When a payment or login fails, we can follow the same trace_id in logs and traces. Failure detection is not only "service down" but "succeeding with degraded semantics"—e.g. we alert when we cannot verify identity and are serving unverified sessions, so the decision to degrade is explicit and visible.