Rate Limiting & Quota Enforcement

TL;DR

Rate limiting and quota enforcement are performed at request admission using centralized state, allowing distributed gateway instances to make consistent allow/deny decisions. Enforcement is treated as an economic safety boundary: requests are evaluated against client subscription limits and upstream provider constraints before forwarding. Fixed-window counters are used to balance correctness with operational simplicity, explicitly accepting some burstiness while prioritizing conservative enforcement over the risk of provider limit violations.

Enforcement Invariant

A request must never be forwarded if doing so would exceed either the client’s subscription limits or an upstream provider’s safety constraints. When enforcement state is ambiguous or stale, the system rejects the request rather than risk over-consumption.

Problem

The gateway must enforce both client-facing usage limits and upstream provider constraints while operating as a distributed system with multiple gateway instances.

These limits are economic and contractual in nature. Exceeding provider limits risks upstream vendors' losses, while incorrect client enforcement breaks subscription guarantees and billing correctness.

Why Naïve Approaches Fail

Per-instance, in-memory rate limiting cannot provide correct global enforcement once traffic is distributed across multiple gateway nodes.

Delegating enforcement to downstream services breaks atomicity between request admission, quota tracking, and billing decisions, creating inconsistencies under load or partial failure.

Solution

Rate limiting is treated as part of request admission rather than an after-the-fact control. Each incoming request is evaluated against client subscription allowances and provider safety limits before being forwarded.

Enforcement state is centralized logically, allowing all gateway instances to make consistent allow/deny decisions without embedding billing or provider logic into downstream services.

Constraints

Hot-path latency: Enforcement runs on every request and must not dominate request handling time.

Safety guarantees: Provider limits must never be exceeded, even under bursty client traffic.

Operational simplicity: Enforcement must remain understandable and debuggable for a small team.

Deferred Complexity

Problem: Enforcement uses fixed window counters rather than a fully continuous refill model.

This allows bursty behavior at window boundaries. Eliminating this entirely would require more complex state tracking and coordination, which was not justified given observed client behavior.

Tradeoffs Accepted

Conservative enforcement: The system prefers rejecting borderline requests over risking provider limit violations.

Simpler state models: Fixed windows reduce storage and coordination cost at the expense of precise smoothing.

What I'd Change at Scale

Introduce sliding-window or hybrid enforcement models once client behavior or request volume justifies the added complexity, along with stronger observability into quota consumption patterns.