# AI Gateway

> Track every token. Cap every budget. Route across 17+ providers behind one OpenAI-compatible endpoint. Apinizer governs your LLM, MCP, and agent traffic on the same runtime that already runs your REST APIs — same audit, same identity, same operators on call.

*AI Gateway — agentic plane*

## Govern every AI request. Tokens, cost, and risk — one gateway.

[Request a demo](https://calendly.com/apinizer/15min) · [Read the docs](https://apinizer.com/developers/docs)

**Highlights**

- **Providers** — 17+ LLMs
- **Standards** — OpenAI · MCP · A2A
- **Modalities** — Chat · Embed · Audio · Image · Video

---

## Capabilities

### Token economics and cost control — first-class

Every prompt, every response, every embedding — counted, attributed, and capped. Set token budgets per project, team, user, or API key, in any time window. The same three-tier permission model (System / Project / Team) that owns REST quotas owns AI spend, so the people who own the workload also own the bill.

- Live token tracking — input, output, cached, and total — per request
- Per-window ceilings: minute, hour, day, month, or contract period
- Per-scope ceilings: user, API key, team, project, or model class
- Hard caps, soft caps with warnings, and burst windows for short spikes
- Cost attribution back to a project or cost center — finance gets a line item, not a mystery
- Auto fall-back to a cheaper model or a cached answer when the budget tips

### Cost-aware multi-LLM routing

Write the application once against an OpenAI-compatible endpoint. The gateway decides which model actually answers — based on cost, latency, model class, or a per-prompt classifier. Drop in a cheaper model for summarisation, send tier-one customers to the frontier model, fall back to a self-hosted model when a provider degrades.

- OpenAI-compatible request and response shape — no client rewrites
- Weighted routing across cost, latency, success rate, or model class
- Per-prompt classifier route — cheap model for the easy 80%, frontier for the hard 20%
- Provider fall-back chains — degrade gracefully when one provider stalls or rate-limits
- Streaming, function calling, tool use, batch, and file upload — preserved across providers
- Pinned model versions and canary releases — promote a cheaper model behind a 5% slice first

### Response cache today — semantic cache coming soon

Skip the token bill on repeat prompts. Apinizer's two-tier Cache fronts LLM responses on an exact-prompt match — a local in-pod tier for sub-millisecond hits, a Hazelcast cluster for cross-pod truth. Same cache the gateway already runs for REST responses, with the same invalidation and the same operator dashboard. Semantic cache, with embedding-similarity matching, is on the roadmap for an upcoming release.

- Two-tier response cache — local in-pod tier plus the Hazelcast cluster shared with the REST gateway
- Exact-match keying on prompt + system message + model + tools — never blends user contexts
- Configurable TTL per route, with stampede protection for hot prompts
- Per-project, per-model, and per-template invalidation — atomic on redeploy
- Live hit-rate, token-spend-saved, and latency-saved reports per cache bucket
- Coming soon: semantic cache with embedding-similarity matching for prompts that mean the same thing

### Prompt engineering with guardrails

Prompt templates, system messages, decorators, and few-shot examples live on the gateway — versioned, reviewed, and promoted across environments like any other artifact. Application code stops carrying the prompt; engineering owns the prompt the way it owns the schema.

- Prompt templates with parameter binding from request, identity, and project context
- System prompts and prompt decorators applied at the gateway, not in the client
- Prompt versioning with rollback, A/B canary, and reviewer sign-off
- Few-shot examples and tool descriptions managed as artifacts
- Per-environment promotion through APIops — dev to test to prod

### Prompt firewall — injection, jailbreak, and data loss

Block the patterns that put regulated AI projects on hold: prompt injection, jailbreak chains, credential and PII exfiltration, off-topic prompts that waste budget, and tool-use abuse. Apinizer runs the guards inline — the bad request never reaches the model.

- Inline prompt guards — injection, jailbreak, role override, system-prompt extraction
- Off-topic and cost guards — refuse essays on company time
- PII detection and redaction on both the prompt and the response
- Credential and secret patterns blocked before they reach the provider
- Tool-use guard — only the tools allowed for the calling identity are exposed
- Custom redaction policies per project — the team owns its own definitions

### AI observability — every prompt, every token, every model

The Analytics Engine ingests AI traffic next to REST traffic. One query answers cost-by-team, latency-by-provider, error-rate-by-model, and which prompt template tripped the firewall last night. Operators see token spend in the same dashboard where they see request rate.

- Token spend by user, project, team, model, and time window
- Latency, time-to-first-token, and throughput per provider and per model
- Error rate, retry rate, and timeout rate per provider
- Tool-use and function-call traces — see the agent's reasoning chain
- Prompt firewall hits with the decision, the rule, and the offending substring
- Real-time anomaly detection on token spend and latency — the bill never surprises you twice

### MCP servers and agent-to-agent governance

Agents talk through the gateway like any other client. Generate Model Context Protocol servers from the APIs you already published, decide which agent can see which tool, and audit every agent-to-agent message — with the same identity surface that fronts your REST and AI traffic.

- Auto-generate MCP servers from existing REST, SOAP, and gRPC APIs
- Per-agent identity provisioned in Identity Manager — scoped tokens, not shared API keys
- Tool-level RBAC — which agent can call which tool, on which project
- Agent-to-agent (A2A) message audit at the persistence layer — same record as REST
- Context Mesh — agents consume API data and event streams through one governed surface
- Per-agent quotas and rate limits — runaway agents cap themselves

### One gateway, one audit, one runtime

AI Gateway is not a side-car. It is a layer of policies on the same gateway that runs your REST, gRPC, WebSocket, SOAP, and GraphQL traffic. Same identity, same audit, same observability, same operators. There is no second control plane to learn, no second pager rotation, no second invoice.

- Same gateway runtime — REST, gRPC, WebSocket, SOAP, GraphQL, and AI on one process
- Same identity surface — OAuth 2.0, OIDC, JWT, mTLS, SAML for humans, agents, and partners
- Same audit at the persistence layer — bypass rejected at compile time
- Same three-tier permission model — System, Project, Team — across API and AI
- Same hot-deploy path — change a prompt or a route without restarting a pod
- Same Kubernetes posture — self-hosted, air-gap friendly, no data leaves the cluster

---

## Use cases

### Stop the AI bill from running away

Token budgets per project. Response cache on the hot path. Cheap-model fall-back when the budget tips. Cost attribution by team and project. The AI line item stops being a surprise, and finance gets a chargeback report instead of a Slack message.

- Per-project, per-team, and per-user token ceilings
- Response cache with hit-rate and savings reports (semantic cache on the roadmap)
- Cheaper-provider fall-back inside a latency target
- Cost attribution back to the project that ran the workload
- Monthly chargeback exports for finance

### Production-grade agents under one governance plane

Agents authenticate like users. Agent-to-agent messages are audited at the persistence layer. Tool access is scoped per identity. Runaway loops cap themselves. The platform team owns one control plane for human and agent traffic — not two.

- Agent identities provisioned in Identity Manager
- Per-agent quotas, rate limits, and tool RBAC
- Agent-to-agent message audit and replay
- Context Mesh for cross-system data access
- MCP servers auto-generated from the API catalog

### PII, secrets, and prompt injection — handled at the edge

Run regulated AI projects without a binder of risk acceptances. Inline prompt firewall, PII redaction, and credential blocking on both directions. Audit trail on every prompt and every response. Same auditor view as the REST estate — one report, two surfaces.

- Inline prompt firewall — injection, jailbreak, role override
- PII and credential redaction on prompts and responses
- Per-project redaction policies — teams own their definitions
- Audit trail at the persistence layer — every prompt, response, and decision
- Air-gap-friendly deployment — your data never leaves your cluster

### One endpoint, many providers — no client rewrites

Application code stays on the OpenAI-compatible surface. The gateway decides who answers — frontier, open-weights, or a self-hosted model — based on cost, latency, or the prompt itself. Promote a cheaper model behind a canary slice before flipping the full route.

- OpenAI-compatible request and response shape
- Weighted routing across cost, latency, success rate
- Per-prompt classification — cheap for easy, frontier for hard
- Provider fall-back chains for degraded providers
- Canary promotion behind a 5% slice before flipping the route

### Chat, embedding, audio, image, and video — one path

Same identity, same audit, same observability for every modality. Embeddings get the same redaction as chat. Image generation gets the same quota window as text. Audio transcripts land in the same Analytics Engine as request logs.

- Chat completions — streaming, batch, function calling
- Embeddings — single, batch, cached
- Audio — transcription, text-to-speech, translation
- Image — generation, edit, variation
- Video — generation across supported providers

---

## What ships in the box

### AI traffic types

- Chat completions — streaming and batch
- Embeddings — single and batch
- Audio — transcription, text-to-speech, translation
- Image — generation, edit, variation
- Video — generation across supported providers
- Function calling and tool use
- Agent-to-Agent (A2A) messages
- Model Context Protocol (MCP) interactions

### Cost & token governance

- Live token tracking — input, output, cached, total
- Ceilings per window — minute, hour, day, month, custom
- Ceilings per scope — user, API key, team, project, model
- Soft warnings, hard caps, burst windows
- Auto fall-back to cached answer or cheaper model on budget tip
- Chargeback exports back to the project that ran the workload

### Security & guardrails

- Prompt injection and jailbreak guards
- PII detection and redaction on prompts and responses
- Credential and secret pattern blocking
- Off-topic and oversize prompt guards
- Tool-use RBAC per identity
- Per-project redaction policies
- Audit trail at the persistence layer

### Operability

- Same gateway runtime as REST, gRPC, and WebSocket
- Same identity, same audit, same RBAC across API and AI
- Hot deploy for prompts, routes, and model selection
- Three-tier permission model (System / Project / Team)
- Live cost, latency, and reliability dashboards
- Kubernetes-native, air-gap-friendly deployment

---

## Resources

- [AI Gateway docs](https://apinizer.com/developers/docs) — Configure providers, set token budgets, write prompt firewall policies, and observe AI traffic alongside REST.
- [Cost & token playbook](https://apinizer.com/developers/docs/ai-gateway/cost-control) — Patterns for project budgets, chargeback, response cache tuning, and cheap-model fall-back chains.
- [Provider quickstarts](https://apinizer.com/developers/docs/ai-gateway/providers) — Drop-in recipes for OpenAI, Anthropic, Bedrock, Azure OpenAI, Gemini, and self-hosted Llama or vLLM.
- [Prompt firewall reference](https://apinizer.com/developers/docs/ai-gateway/firewall) — The guard catalog — injection, jailbreak, PII, credentials, off-topic, tool-use — with policy snippets.
- [AI observability guide](https://apinizer.com/developers/docs/ai-gateway/observability) — Cost, latency, reliability, and firewall dashboards in the Analytics Engine — one query for API and AI.
- [Architecture overview](https://apinizer.com/products) — How the AI plane shares one runtime with the API Gateway, Identity Manager, Cache, and Analytics Engine.
- [Migration from a side-car gateway](https://apinizer.com/developers/docs/ai-gateway/migration) — A short field guide for teams running a dedicated AI gateway today — what to keep, what to retire, what to consolidate.

---

## Next step

*Govern every AI request*

**Bring tokens, agents, and risk under one control plane.**

A 30-minute walkthrough of the Apinizer AI Gateway — token budgets, multi-LLM routing, response cache, prompt firewall, MCP, and AI observability — on a Kubernetes of your choice.

[Book a Demo](https://calendly.com/apinizer/15min) · [Read the docs](https://apinizer.com/developers/docs)

---

## Links

- Products: https://apinizer.com/products
- AI Gateway: https://apinizer.com/products/ai-gateway
- Solutions: https://apinizer.com/solutions
- Pricing: https://apinizer.com/pricing
- Developers: https://apinizer.com/developers
- Documentation: https://docs.apinizer.com/index-en
- Blog: https://apinizer.com/blog
- Contact: https://apinizer.com/company/contact

© 2026 Apinizer. All rights reserved.
