Performance
The Gateway sits on the request path, so its overhead has to be small enough to ignore. We will publish numbers only with full context: hardware, payload sizes, concurrency, and provider.
What we measure
Latency overhead
Added latency versus calling the provider directly, measured at P50 and P99 under both streaming and non-streaming requests.
Throughput
Sustained requests per second per node before latency degrades.
Streaming pass-through
Time to first token compared with a direct provider call, to confirm the Gateway does not buffer the stream.
Methodology
Each published figure will ship with a reproducible test: the node size, the request mix, the concurrency level, the model and provider used, and whether caching was enabled. Numbers without that context are not useful, so we will not post them until the harness and the runtime are stable.