Kubernetes Gateway API Inference Extension: What It Is and Why It Matters
If you are running LLM workloads on Kubernetes today, you have almost certainly hit the same wall. Your standard Ingress or Gateway routes traffic to pods based on HTTP path or headers. Round-robin distributes load across replicas. Everything looks fine from the control plane perspective. But your GPU utilization is uneven, some replicas are getting hammered while others sit idle, latency is unpredictable, and you have no way to differentiate a low-latency interactive chat request from a background batch job that can afford to wait.
The Kubernetes Gateway API Inference Extension is the official answer to this problem. First announced at KubeCon EU 2025 and driven by WG-Serving and SIG-Network, it extends the Gateway API with inference-aware routing primitives that standard load balancers were never designed to handle.
This blog explains what it is, how it works under the hood, and what it means operationally for platform teams running self-hosted models.
Why Standard Load Balancing Fails for LLM Inference
Traditional L7 routing works on a simple principle: look at the request attributes (path, headers, method), pick a backend using a load balancing algorithm (round-robin, least-connections, ring-hash), and forward. This works well for stateless, short-lived requests where every replica is functionally identical and equally capable of handling any request at any time.
LLM inference breaks all of those assumptions.
Replicas are not equally capable at any given moment. A vLLM replica serving a large batch may have 90% of its KV cache occupied and be close to memory limits. Another replica with a lighter load has plenty of headroom. Round-robin sending a new long-context request to the first replica will cause KV cache eviction, latency spikes, or an OOM. Sending it to the second replica is fine.
Requests are not stateless. Prompt caching means a replica that has already processed a common system prompt prefix has that prefix cached in its KV cache. Routing the same prefix to a different replica wastes the cache and adds latency. The router needs to know which replica has the relevant cache before making a routing decision.
Not all requests have equal urgency. An interactive user waiting for a response needs sub-second time-to-first-token. A background summarization job running overnight can afford to wait in a queue. Standard gateways have no concept of request criticality; they treat every request the same.
Model identity is embedded in the request body, not the headers. The OpenAI-compatible API carries the model name in the JSON body as "model": "llama-3-70b". Standard HTTP routers cannot inspect the body to make routing decisions without extension logic.
These gaps were being patched with ad-hoc solutions everywhere: custom sidecars, bespoke routing logic written into application code, proprietary load balancers that only worked with specific inference servers. The Gateway API Inference Extension standardizes this at the Kubernetes layer.
How It Works
The extension uses Envoy’s External Processing (ext-proc) mechanism. Ext-proc allows an external gRPC service to intercept requests at the proxy layer, inspect or modify them, and make routing decisions before the proxy forwards them to a backend. Any gateway that supports both ext-proc and the Gateway API can be extended into an Inference Gateway.
This is the key architectural decision: the extension does not require a new proxy or a new control plane. It extends existing gateways like Envoy Gateway, kgateway, GKE Gateway, and NGINX Gateway Fabric by adding an inference-aware decision layer on top of what you already have.
The three main components involved are:
Inference Gateway (IGW): The proxy layer, an existing Gateway API implementation extended with ext-proc to become inference-aware. It handles the actual traffic forwarding.
Inference Scheduler / Endpoint Picker (EPP): The decision-making component. When a request arrives, the gateway passes it to the EPP over ext-proc. The EPP inspects the request body, reads real-time metrics from model servers, applies routing logic, and returns a routing decision to the gateway. The EPP is pluggable; you can replace or extend it with custom logic.
Model Server Metrics: The EPP reads live metrics from inference servers such as KV cache utilization, active request count, queue depth, available LoRA adapters, and prefix cache hit status. These are the signals that make routing decisions inference-aware rather than generic.
The request flow looks like this:
Client Request
|
v
Inference Gateway (Envoy / kgateway / GKE Gateway)
|
| ext-proc call (gRPC)
v
Endpoint Picker (EPP)
- Parse model name from request body
- Read live metrics from model server pods
- Apply routing filters (criticality, cache affinity, load)
- Return target endpoint
|
v
Gateway forwards to selected vLLM / TGI pod
The New API Resources
The extension introduces two new CRDs alongside the existing Gateway API resources.
InferencePool
InferencePool replaces Service as the backend reference for LLM traffic. It represents a group of pods that share the same base model, accelerator type, and model server configuration. Think of it as a typed backend that the gateway knows is an inference workload rather than a generic HTTP service.
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama3-70b-pool
spec:
selector:
matchLabels:
app: vllm-llama3-70b
targetPortNumber: 8000
extensionRef:
name: llama3-endpoint-picker
Within an HTTPRoute, you reference an InferencePool the same way you would reference a Service in a backendRef. The gateway knows to apply inference-specific routing logic when the backend is an InferencePool rather than a plain Service.
InferenceObjective (formerly InferenceModel)
InferenceObjective defines the routing objective for a named model as clients see it. It maps a client-facing model name to a backend model deployment and specifies the criticality of requests using that model name.
apiVersion: inference.networking.k8s.io/v1alpha2
kind: InferenceObjective
metadata:
name: llama3-chat
spec:
modelName: llama3-chat
criticality: Critical
poolRef:
name: llama3-70b-pool
When the EPP sees a request with "model": "llama3-chat", it looks up the corresponding InferenceObjective, reads the criticality level, and factors that into the routing decision alongside real-time metrics.
Routing Intelligence: What the EPP Actually Does
The Endpoint Picker applies a sequence of filters to select the best pod for each request. Understanding these filters is important for operational tuning.
Model name resolution. The EPP parses the request body and extracts the model name. It matches this against InferenceObjective resources to find the target InferencePool and routing configuration. This is how the extension routes by model identity rather than just HTTP path.
Criticality-based load shedding. Requests marked as Sheddable (batch jobs, background processing) can be dropped or queued when the system is under load. Requests marked as Critical (interactive sessions) are protected. This is the mechanism that prevents a batch job from consuming the last available GPU slot before an interactive user’s request arrives.
KV cache affinity. If a model server pod has a cached prefix that matches the incoming request, the EPP routes to that pod to get a cache hit. This reduces time-to-first-token for requests with shared prefixes such as long system prompts. The EPP reads prefix cache status from model server metrics to make this decision.
Load-aware selection. Among pods that pass the above filters, the EPP selects based on current load indicators: KV cache utilization, active request count, and queue depth. This is the “smart” load balancing that standard round-robin cannot do.
What This Enables Operationally
Model Rollouts Without Custom Scripts
Because routing is model-name-aware, you can split traffic between model versions declaratively. An HTTPRoute can send 90% of llama3-chat requests to one InferencePool running the current model version and 10% to another pool running a candidate version. This is A/B testing and canary deployment for LLM models, done the same way you would do it for any other Kubernetes service.
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: llama3-70b-v1
weight: 90
- group: inference.networking.k8s.io
kind: InferencePool
name: llama3-70b-v2
weight: 10
LoRA Adapter Routing
A single base model deployment running on an InferencePool can serve multiple LoRA adapters loaded into the same model server. The EPP can route requests to the specific pod that has the requested adapter loaded, avoiding the latency cost of loading it on demand.
This matters because loading a LoRA adapter into a running vLLM instance is not free. If you have ten adapters and ten pods and adapters are not distributed uniformly, routing without adapter awareness will cause frequent adapter swaps. The EPP eliminates this by routing to a pod that already has the adapter resident.
Multi-Tenant Platform Operations
The criticality and load-shedding mechanism is the foundation for multi-tenant inference platforms. A platform team can serve multiple teams from the same InferencePool, with different criticality tiers for each team’s requests. Internal development workloads get Sheddable priority. Production user-facing workloads get Critical priority. The gateway enforces these boundaries without requiring the platform team to run separate GPU pools for each team.
Observability Around Service Objectives
The extension adds end-to-end observability around whether inference workloads are meeting their objectives. You can track per-model metrics such as whether Critical requests are being served within target latency, what percentage of Sheddable requests are being dropped under load, and what the EPP’s routing decisions look like over time.
Current Implementation Support
The extension works by extending existing gateways, so support depends on which gateways have implemented ext-proc plus the inference extension integration:
- Envoy Gateway: Supported
- kgateway (formerly Gloo Gateway): Supported
- GKE Gateway: Supported (managed, on GKE)
- NGINX Gateway Fabric: Supported as of v2.6.3
- Alibaba Cloud ACK Gateway: Supported
- Istio: In progress, tracked via GitHub issue
The EPP and associated scheduling APIs have also been moved to the llm-d project, a CNCF-donated initiative from IBM, Red Hat, and Google Cloud that implements disaggregated serving by splitting prefill and decode phases across separate pod pools. The Gateway API Inference Extension and llm-d are designed to work together.
What Is Not There Yet
This is worth being direct about. As of mid-2026, the extension is still experimental and not recommended for production use without careful evaluation.
HPA integration is not complete. Autoscaling based on aggregate inference metrics derived from the load balancer is on the roadmap but not yet shipped. You cannot currently trigger a scale-out based on EPP-observed queue depth or KV cache pressure across the pool.
Heterogeneous accelerator support is in progress. Routing across pools with different GPU types using latency and cost-aware logic is planned but not fully implemented.
Prefix-cache aware load balancing with remote caches is roadmapped but not yet supported. Currently, KV cache affinity only works for caches resident in individual pod memory.
The EPP and InferenceObjective APIs have recently moved to llm-d. If you built on earlier versions of the extension, some of the component boundaries have shifted.
Should You Use It Now?
If you are building a new self-hosted LLM serving platform on Kubernetes and your gateway already supports ext-proc, yes, you should evaluate it. The two CRDs are not a large adoption surface. The routing improvements for prefix-cache affinity and criticality-based load shedding are immediately practical. Model rollout via traffic splitting is directly useful and does not require any new operational complexity.
If you are running production traffic today with hard latency SLOs, hold off until the HPA integration matures. The lack of autoscaling based on inference-native metrics means you are still patching that part with external solutions.
If you are on a managed Kubernetes platform like GKE, the integration is cleaner because the gateway implementation is handled for you. The friction of setting up ext-proc and the EPP is mostly abstracted away.
The Bigger Picture
The Gateway API Inference Extension is part of a broader shift in how Kubernetes handles AI workloads. The same KubeCon EU 2026 cycle that saw llm-d donated to CNCF also saw NVIDIA’s Dynamic Resource Allocation driver donated, enabling fine-grained GPU sharing at the hardware level. The pieces are being assembled: smarter scheduling, disaggregated serving, inference-aware routing, and GPU resource management are all converging in the Kubernetes ecosystem.
The Gateway API Inference Extension is the networking layer of that stack. It gives the control plane visibility into what is actually happening inside model servers and uses that visibility to make routing decisions that reduce latency, improve GPU utilization, and enforce operational boundaries between tenants.
For SREs and platform engineers running self-hosted models, this is the first time these capabilities have had a standardized, Kubernetes-native API. Worth tracking closely even if you are not ready to run it in production today.
Project repository: https://github.com/kubernetes-sigs/gateway-api-inference-extension Official docs: https://gateway-api-inference-extension.sigs.k8s.io