Jailbreak detection
Pattern + classifier blends score every inbound prompt. Suspected jailbreaks block, get redirected to a hardened model, or trigger an alarm — your policy choice.
AI teams · Safety
Apinizer's AI Gateway applies policy on the way in and on the way out — jailbreak detection, PII redaction, injection scoring, regex denylists, and policy-driven blocks — without changing the application.
The problem
Every model has a jailbreak; every prompt can carry PII; every completion can leak. The fix isn't waiting for a perfect model — it's putting a firewall in front of every model and a filter in front of every completion. Apinizer's AI Gateway does both: detect, redact, score, block, audit. Same plane as the API.
Capabilities
Pattern + classifier blends score every inbound prompt. Suspected jailbreaks block, get redirected to a hardened model, or trigger an alarm — your policy choice.
Names, IDs, account numbers, addresses — redacted before the prompt reaches the model and before the completion reaches the user. Configurable per locale.
RAG-injection patterns and tool-poisoning attempts scored on the way in. Suspicious context blocked before it reaches the agent.
Block secrets, source code, internal hostnames, or anything else you don't want leaving the model. Filter applies before the response reaches the consumer.
Firewall rules ship as data, not code. Review in Git, apply via APIops, propagate to every Worker in seconds.
Every block and redaction captured with reason, score, and policy reference. Auditors and developers see the same explanation.
Use cases
Outbound filter detects 16-digit patterns adjacent to keywords. Blocked completions log an explanatory event; the user gets a safe fallback message.
Inbound redaction strips names, IDs, dates. The model summarizes; the completion is re-keyed back to the patient on the gateway side, never inside the model.
0 PHI to model
1.8% of citizen-chatbot prompts flagged jailbreak-suspicious. Half rerouted to a hardened model with a stricter system prompt; half blocked outright.
Documents uploaded by customers occasionally carry 'ignore previous instructions' patterns. Scorer blocks the prompt; SOC reviews the document offline.
Outbound filter blocks any response containing API keys or repo paths. Editorial productivity unchanged; risk posture significantly improved.
Each jurisdiction's national identifiers — tax IDs, citizen numbers, social-security formats — redacted with the right pattern in the right locale. Same firewall, different rules per region.
An adversarial document tried to coerce an operations agent into changing SCADA parameters. Injection scorer caught it; the agent never saw it.
Per-locale PII patterns, jailbreak rules, and outbound denylist tuned for the local language. The compliance officer signs the audit pack without changes.
Recommended products
Prompt firewall built in — jailbreak detection, PII redaction, injection scoring, outbound filters.
Open the AI Gateway pageBlocks and redactions appear in the same telemetry as cost and latency.
Open the Analytics pageSeverity-aware alarms when block rate spikes or new jailbreak patterns appear.
Open the Monitoring pageTie firewall outcomes to consumer identity — repeat offenders revoked at the auth layer.
Open the Identity pageResources
Inbound and outbound rules, scoring blends, policy-as-data — how the firewall composes.
The lane the firewall runs on — alongside routing, caching, and audit.
Anomaly detection on block rates and emerging patterns.
Firewall rules ship as data, review in Git, apply idempotently.
How firewall outcomes feed KVKK / GDPR / BDDK evidence.
Where the firewall sits in the AI lane.
Safety as a runtime property
A 30-minute walkthrough — jailbreak, PII, injection, outbound filters — on a Kubernetes of your choice.