Multi-LLM routing on your own Kubernetes
What it actually means to give your applications one OpenAI-compatible endpoint and route across 17 providers behind it — without writing a microservice.
Mar 18, 2026 · 2 min read · Mehmet Karaca, Platform Engineer · Engineering
Tags: #ai-gateway · #llm · #routing · #kubernetes
Most applications writing against an LLM today pin one provider. The choice happens early — usually around the first prototype — and then it gets hard to change. Pricing shifts, latency drifts, a new model comes out, the regulator asks where prompts are going, and the application team rebuilds the integration.
Apinizer's AI Gateway gives you one OpenAI-compatible endpoint your applications speak to, and routes the request across 17 providers behind the gateway. Switching is a manifest change, not a sprint.
The shape of the contract
Applications keep calling POST /v1/chat/completions against the
gateway. The gateway is responsible for picking a provider, translating
the request to that provider's native format, and translating the
response back.
# routes.yaml — APIops manifest
ai_routes:
- name: chat-mid-tier
match:
model: "gpt-4o-mini"
targets:
- provider: anthropic
model: claude-haiku-4-5
weight: 60
max_latency_ms: 800
- provider: openai
model: gpt-4o-mini
weight: 40
max_latency_ms: 800
fallback: ollama/llama-3.1
The application doesn't change. The route does.
What "OpenAI-compatible facade" actually means
It means three things:
- The request shape coming into the gateway is OpenAI's.
- The response shape going back is OpenAI's.
- Streaming works — including SSE chunked responses.
Anthropic's messages field maps to OpenAI's messages. Bedrock's
inferenceConfig maps to OpenAI's temperature / top_p. Gemini's
safetySettings get filled from the Apinizer policy chain. The gateway
handles all of this; the application stays on the OpenAI SDK it already
uses.
Quotas and audit
Every request goes through the same MessageContext your REST traffic
uses. Per-credential quotas are enforced before the provider call. The
audit trail captures the prompt, the chosen route, the provider, the
token count, and the response time — alongside REST audit data, in the
same Elasticsearch index.
Your operators don't need a second observability stack for AI traffic. The Analytics Engine they're already running picks it up.
Failover that doesn't lose traffic
Failover is policy-driven, not retry-on-error. If the chosen provider returns a 5xx, exceeds the latency target, or hits a quota ceiling, the next provider in the route runs. The application sees one response shape, one timeout, one log line.
Self-hosted fallbacks (vLLM, Ollama, Llama) are first-class — many regulated customers run a self-hosted "last-resort" provider so traffic never leaves the cluster, even if every external provider is down.
What's next
The route format above is stable for 2026.04. We're working on semantic routing — choose a provider per-prompt based on prompt classification, not just per-route — for 2026.09. If you have a use case that wants per-prompt routing, ping us.
All posts · Book a Demo · Read the docs
Links
- Products: https://apinizer.com/products
- AI Gateway: https://apinizer.com/products/ai-gateway
- Solutions: https://apinizer.com/solutions
- Pricing: https://apinizer.com/pricing
- Developers: https://apinizer.com/developers
- Documentation: https://docs.apinizer.com/index-en
- Blog: https://apinizer.com/blog
- Contact: https://apinizer.com/company/contact
© 2026 Apinizer. All rights reserved.