Capability-based routing
Tag prompts by intent. Route summarization to a small model, code generation to a code-specialist, vision to a multi-modal — without changing the application.
AI teams · Routing
Apinizer's AI Gateway routes every prompt by cost, latency, capability, and policy. Frontier model when needed; regional when sufficient; open-weight when sovereign — all under one identity, one audit, one runtime.
The problem
Services pick a model in a sprint and inherit its bill, its rate limit, its sovereignty profile, and its outages forever. When the provider has an incident, the app does too. When a cheaper model would serve, the team pays the frontier price anyway. Apinizer turns the model into a policy decision — per call, not per service.
Capabilities
Tag prompts by intent. Route summarization to a small model, code generation to a code-specialist, vision to a multi-modal — without changing the application.
Free models first; paid only on fallback. Frontier providers only on intents that need them. The gateway picks the cheapest sufficient model per call.
Provider hiccup? Traffic rolls to the next provider in the pool with the same capability profile. Application doesn't know there was an incident.
Personal data routes only to providers in approved jurisdictions. The policy is data, not code; the rule applies to every call automatically.
Try a new model on 5% of traffic with the same auth and audit. Compare cost, latency, quality side-by-side before flipping the default.
Local llama / mistral / qwen deployments live in the routing pool alongside hosted providers. The application doesn't choose; the policy does.
Use cases
90% of customer-service summaries handled by a TR-tuned model hosted in-country. Frontier providers used only for adversarial or English-mixed cases.
90% local, 10% frontier
Code generation and review go to a code-specialized model; design-doc summarization to a general model. Tail latency drops; quality goes up.
Routing rule reads the request's data classification. Anything tagged PII routes only to providers in approved jurisdictions; everything else has the full pool.
Frontier provider returns 5xx for 14 minutes. The gateway rolls to the secondary; application keeps serving without an incident page.
0 user-facing impact
Shadow traffic confirms equivalent quality at 60% lower cost. Cutover happens with one policy change; rollback would have been just as easy.
Compliance-approved model list maintained centrally. Routing never picks an uncertified model; auditors stop asking 'which model answered'.
Sovereign Arabic LLM gets first call; frontier providers as fallback. Cost falls; sovereignty story tightens.
Local 70B model runs SCADA agent prompts. Hosted providers reserved for non-operational use. The agent never leaves the operator network.
Recommended products
Capability-based routing, cost-aware tiers, sovereignty rules, A/B and shadow traffic.
Open the AI Gateway pagePer-model, per-intent, per-consumer telemetry to compare options.
Open the Analytics pageCache layer that backs semantic responses, regardless of which model answered.
Open the Cache pageProvider health and latency probes; severity-aware alarms on degradation.
Open the Monitoring pageResources
How capability, cost, sovereignty, and A/B rules compose into routing policy.
The lane every AI call lives on — providers, MCP, agents.
Per-provider, per-intent, per-consumer telemetry.
Where the AI lane sits in the topology.
Routing policy ships as code, reviews in Git, applies idempotently.
Where routing composes with MCP governance, A2A registry, firewalls, and cache.
Explore more
Right model per call
A 30-minute walkthrough — capability routing, cost tiers, sovereignty rules — on a Kubernetes of your choice.