Event Delivery Guarantees

TL;DR

Event delivery is designed to keep broker health and delivery retries off the request hot path. The gateway synchronously persists events to Redis Streams before request completion, while Kafka publishing is handled asynchronously by background workers. This provides at-least-once delivery with bounded request latency, explicitly prioritizing request isolation and reduced blast radius over end-to-end durability across all storage layers. Redis is treated as the primary availability and durability boundary, with PostgreSQL as a fallback under Redis failure.

Event Delivery Invariant

An event must never be acknowledged as delivered unless it has been durably recorded in at least one persistence layer. Duplicate delivery is acceptable; silent event loss is not. When delivery state is uncertain, the system favors retry over drop.

Problem

The gateway emits high-volume operational events (usage logs, billing signals, audit records) directly on the request hot path.

Event delivery must satisfy three competing constraints:

High throughput under concurrent request load
Minimal impact on request latency
No silent event loss

Synchronous publishing to Kafka introduces unpredictable latency and couples request handling to broker health. Asynchronous publishing improves latency but risks data loss on crashes or transient failures.

Why Naïve Approaches Fail

Direct synchronous Kafka publishing adds disk-backed broker latency to every request and amplifies tail latency during leader elections or ISR shrink events.

Fire-and-forget async publishing avoids latency but loses events on process crashes, OOMs, or transient network failures, with no recovery mechanism.

For a gateway-as-a-service, pushing the latency vs consistency tradeoff onto the request path increases blast radius and creates user-visible outages.

Solution

Event delivery is decoupled from Kafka using a multi-tier buffering and failover pipeline.

Redis Streams act as the primary ingestion layer
Kafka remains the final delivery target
PostgreSQL provides a durable fallback under Redis failure

The gateway synchronously publishes events only to Redis. Kafka publishing is handled asynchronously by background workers, removing broker health from the request critical path.

Architecture Overview

Requests complete after a successful Redis append. Background workers handle Kafka delivery, retries, and recovery independently.

Redis Streams provide fast, in-memory buffering
Consumer groups ensure distributed delivery
Unacknowledged messages are reclaimed automatically
PostgreSQL absorbs writes during Redis outages

Core Design Principles

Reduced blast radius: Kafka failures do not affect request latency
Persist before acknowledge: Events are stored before request completion
Bounded coordination: No global locks on the hot path
Idempotency: Duplicate delivery is tolerated, not silent loss

Fast Path: Redis Streams

Events are synchronously appended to Redis Streams using atomic operations. Redis provides predictable latency and high throughput under contention.

Each event is deduplicated before insertion and enters a consumer group for asynchronous Kafka delivery.

Background Processing Model

Background workers consume events from Redis Streams and publish them synchronously to Kafka.

Messages are acknowledged only after successful Kafka delivery. Retryable errors leave messages pending, allowing automatic retries or reclamation by other replicas.

Durable Fallback: PostgreSQL

When Redis becomes unhealthy, events are synchronously persisted to PostgreSQL. A circuit breaker bypasses Redis to avoid cascading failures.

Once Redis recovers, a coordinated drain process moves events back into Redis for normal processing.

Dead Letter Queue (DLQ)

Non-retryable failures are routed to a Dead Letter Queue, preserving error context and preventing poison messages from blocking the pipeline.

Consistency Guarantees

At-least-once delivery to Kafka
No silent event loss
Eventual delivery under partial failures

Deferred Complexity

Exactly-once semantics were avoided due to coordination overhead. Idempotent consumers provide sufficient safety.

Cross-stream ordering and per-event disk durability were deferred to preserve throughput and reduce hot-path latency.

Why This Is Still Not 100% Guaranteed

Despite multi-tier buffering, this system does not provide absolute delivery guarantees. There remains a narrow but real failure window where both Redis and PostgreSQL are unavailable simultaneously.

In such a scenario, the gateway loses its ability to persist events before request completion. This is an explicit and acknowledged risk, not an oversight.

The design intentionally treats Redis as the primary durability layer. On a constrained infrastructure budget, investing in high availability for Redis yields the largest improvement in overall SLA, while treating PostgreSQL as a secondary safety net rather than a continuously replicated ledger.

Redis as the System Heart

Introducing Redis Streams shifts Redis from a cache-like role into a critical system dependency. It becomes the coordination point for ingestion, ordering, retries, and recovery.

This is a deliberate tradeoff. Redis is memory-backed, fast under contention, and operationally simpler to harden with replication and failover compared to disk-heavy Kafka coordination on the request path.

The architecture optimizes for high availability of Redis rather than theoretical fault tolerance across all storage layers.

Consumption Idempotency

The pipeline guarantees at-least-once delivery. Duplicate delivery is therefore expected and must be handled by consumers.

Consumers are designed to be idempotent, using deterministic event identifiers rather than relying on Kafka offsets or delivery ordering.

This pushes correctness to the edge of the system, where business semantics are known, rather than attempting to enforce exactly-once behavior in the transport layer.

Planned Inbox-Based Deduplication

To strengthen consumer-side guarantees, an inbox-style deduplication mechanism is planned.

Each consumer will maintain a lightweight inbox keyed by event ID, recording delivery time and an expected expiration window. Incoming events are first checked against this inbox before processing.

If an event ID is already present and marked complete, the event is discarded. Otherwise, it proceeds to processing and is marked as completed afterward.

This approach reduces duplicate side effects without relying on Kafka exactly-once semantics or transactional producers.

Known Weakness: Post-Processing Marking

The inbox pattern introduces its own failure mode: if event processing completes successfully but the completion marker fails to persist, the system may reprocess the same event.

This creates a narrow consistency gap where side effects have occurred but the deduplication record is missing.

Eliminating this gap entirely would require transactional coupling between business side effects and inbox state updates, reintroducing coordination costs that this architecture intentionally avoids.

Tradeoffs Accepted

Memory over disk: Redis is treated as the primary ingestion layer to keep the request path fast, accepting higher reliance on Redis availability.
At-least-once semantics: Duplicate delivery is possible and pushed to consumer-side idempotency rather than enforcing exactly-once at the transport layer.
Bounded durability: A narrow failure window exists if both Redis and PostgreSQL are unavailable simultaneously.
Operational complexity: Multiple recovery paths (Redis, PostgreSQL, DLQ) increase internal complexity in exchange for reduced blast radius and better overall SLA.

What I'd Change at Scale

At higher scale or stricter compliance requirements, the event pipeline would evolve toward stronger durability and clearer ownership boundaries.

Sharded Redis Streams to reduce memory pressure and single-stream contention
Adaptive batching based on Kafka broker health and consumer lag
Inbox-based deduplication with tighter coupling between business side effects and completion markers
Append-only event ledger for regulatory-grade auditability and replay guarantees