Multi-LLM routing on your own Kubernetes

What it actually means to give your applications one OpenAI-compatible endpoint and route across 17 providers behind it — without writing a microservice.

Mar 18, 2026 · 2 min read · Mehmet Karaca, Platform Engineer · Engineering

Tags: #ai-gateway · #llm · #routing · #kubernetes

Most applications writing against an LLM today pin one provider. The choice happens early — usually around the first prototype — and then it gets hard to change. Pricing shifts, latency drifts, a new model comes out, the regulator asks where prompts are going, and the application team rebuilds the integration.

Apinizer's AI Gateway gives you one OpenAI-compatible endpoint your applications speak to, and routes the request across 17 providers behind the gateway. Switching is a manifest change, not a sprint.

The shape of the contract

Applications keep calling POST /v1/chat/completions against the gateway. The gateway is responsible for picking a provider, translating the request to that provider's native format, and translating the response back.

# routes.yaml — APIops manifest
ai_routes:
  - name: chat-mid-tier
    match:
      model: "gpt-4o-mini"
    targets:
      - provider: anthropic
        model: claude-haiku-4-5
        weight: 60
        max_latency_ms: 800
      - provider: openai
        model: gpt-4o-mini
        weight: 40
        max_latency_ms: 800
    fallback: ollama/llama-3.1

The application doesn't change. The route does.

What "OpenAI-compatible facade" actually means

It means three things:

The request shape coming into the gateway is OpenAI's.
The response shape going back is OpenAI's.
Streaming works — including SSE chunked responses.

Anthropic's messages field maps to OpenAI's messages. Bedrock's inferenceConfig maps to OpenAI's temperature / top_p. Gemini's safetySettings get filled from the Apinizer policy chain. The gateway handles all of this; the application stays on the OpenAI SDK it already uses.

Quotas and audit

Every request goes through the same MessageContext your REST traffic uses. Per-credential quotas are enforced before the provider call. The audit trail captures the prompt, the chosen route, the provider, the token count, and the response time — alongside REST audit data, in the same Elasticsearch index.

Your operators don't need a second observability stack for AI traffic. The Analytics Engine they're already running picks it up.

Failover that doesn't lose traffic

Failover is policy-driven, not retry-on-error. If the chosen provider returns a 5xx, exceeds the latency target, or hits a quota ceiling, the next provider in the route runs. The application sees one response shape, one timeout, one log line.

Self-hosted fallbacks (vLLM, Ollama, Llama) are first-class — many regulated customers run a self-hosted "last-resort" provider so traffic never leaves the cluster, even if every external provider is down.

What's next

The route format above is stable for 2026.04. We're working on semantic routing — choose a provider per-prompt based on prompt classification, not just per-route — for 2026.09. If you have a use case that wants per-prompt routing, ping us.

All posts · Book a Demo · Read the docs