Agent mode·Plain-text view for agents and LLMsraw md →

Prompt firewalls — runtime safety for LLM traffic

Jailbreak detection, PII redaction, injection scoring, outbound content filtering. What it takes to make LLM safety a runtime property instead of a model property.

Nov 4, 2025 · 9 min read · Selin Demir, VP Product · AI

Tags: #ai-gateway · #prompt-firewall · #safety · #compliance


LLM safety is not a property of the model. It is a property of the runtime in front of the model.

Every frontier model has a jailbreak. Every prompt can carry PII. Every completion can leak something it shouldn't. Waiting for a perfect model is not a strategy. Putting a firewall in front of the model — and a filter in front of every completion — is.

This post walks through what an AI prompt firewall actually does, what it doesn't, and how it fits into the rest of the platform.

Why the gateway, not the application

The easy answer to "we need prompt safety" is "let's add safety checks to every application." We've watched a few teams try this. It doesn't end well.

Every application reinvents the same checks. The bank chatbot, the underwriting agent, and the document summarizer all need PII redaction. Three teams, three implementations, three slightly different blocklists.

The checks drift. When the threat model changes (new injection pattern, new jailbreak vector), three teams have to update three implementations. One of them won't.

The audit trail fragments. Three implementations means three log shapes. The compliance team can't answer "how many redactions across the platform last quarter" without joining three data sets.

The risk migrates to the slowest implementation. Attackers find the application with the weakest checks. The whole platform inherits that weakest-link risk.

A prompt firewall at the gateway fixes all four. One implementation. One policy. One audit shape. One place to update when the threat model shifts.

What a prompt firewall actually does

Apinizer's AI Gateway runs the same five-phase pipeline as the REST gateway. Prompt firewall policies sit in pre-flow and post-flow. Pre-flow runs against the inbound prompt; post-flow runs against the outbound completion.

[ pre-flow: prompt firewall ] → [ routing: pick LLM ] → [ upstream call ] → [ post-flow: completion firewall ] → response

Specifically, four classes of checks:

1. Jailbreak detection

Pattern blends — a mix of regex, semantic similarity to known jailbreak corpora, and a small classifier trained on labeled jailbreak attempts. Each candidate prompt gets a score.

Three actions on a high score:

  • Block. The prompt never reaches the upstream model. Consumer receives a defined error response.
  • Redirect. The prompt routes to a hardened model with a stricter system prompt. Useful when the platform has both a primary and a hardened LLM available.
  • Alarm. The prompt passes through, but the security team gets notified. Useful in exploratory phases before tuning the action thresholds.

The action per score band is policy-configurable. Different lines of business pick different thresholds for the same firewall.

2. PII redaction

Names, government IDs (TCKN, IBAN, SSN, fiscal codes, NHS numbers), account numbers, addresses, dates of birth, email addresses.

Two flavors of action:

  • Strip. Replace with a placeholder before the prompt leaves the cluster. The model never sees the raw value.
  • Redact-and-rekey. Strip on the way in; restore on the way back. The model summarizes against placeholders; the gateway substitutes the real values back into the response just before delivery. The model never holds the PII; the consumer sees the right output.

The redact-and-rekey pattern is the one most regulated customers want. The model gets to do its job; the PII never crosses the boundary.

3. Injection scoring

Inbound prompts get scored for prompt-injection patterns ("ignore previous instructions"), RAG-injection patterns (documents trying to hijack the agent), and tool-poisoning patterns (instructions telling the agent to call dangerous tools).

This is different from jailbreak detection — jailbreak attempts try to get the model to break its safety constraints. Injection attempts try to get the model to do something the consumer didn't ask for, sometimes via a document or a retrieved context.

High injection scores can block the prompt outright, or block just the suspected portion (a retrieved document) while letting the rest of the prompt through.

4. Outbound content filters

Completions get scanned before delivery. Block secrets (API keys, JWTs, internal hostnames), block source code if the policy disallows it, block PII that crept into the response, block anything else policy says shouldn't leave the model.

The outbound filter is the one that catches "the model leaked something." It also catches the case where the model hallucinates a real-looking secret. Both get blocked the same way.

What "policy as data" means in practice

Firewall rules ship as data, not code. They live in a manifest, get reviewed in Git, get applied via APIops, and propagate to every Worker within seconds via hot deployment.

# prompt-firewall.yaml — APIops manifest
ai_firewall:
  - name: customer-service-strict
    inbound:
      jailbreak:
        threshold: 0.7
        action: block
      pii:
        patterns: [tckn, iban, email, phone]
        action: strip
      injection:
        threshold: 0.6
        action: block-segment
    outbound:
      secrets:
        action: block
      pii:
        patterns: [tckn, iban, email]
        action: strip

A new injection pattern shows up in the wild on a Tuesday. The security team updates the manifest, opens a PR, gets a review, and applies. Every Worker enforces the new rule within seconds. No deploy window. No incident bridge.

This is the same APIops manifest pattern the REST gateway uses for proxy policies — same toolchain, same audit trail, same review flow. The AI plane isn't a separate platform; it's the same platform with an additional manifest type.

How explainability works

Every block, redirect, redact, and alarm gets an event in the audit trail with:

  • Timestamp and consumer identity
  • The action taken and the reason
  • The score(s) that triggered the action
  • The policy reference that was applied
  • A redacted snippet of the offending portion (without PII)

When the auditor asks "why was this prompt blocked," there's one query. When the developer asks "why is my legitimate request being blocked," there's the same query — they get the policy reference, they look at the score, and they can adjust the threshold or whitelist their use case in the manifest.

Same explainability shape for the security team and the developer. Same data source. Same trust.

What blocks per industry actually look like

We've watched the rollout shape repeat across customer segments.

Banking. Outbound filters on completions catch account-number-shaped patterns in chatbot responses. Inbound PII strips IBAN, TCKN, and card numbers before any model call. Jailbreak threshold tuned high — false positives are acceptable; false negatives aren't.

Healthcare. Inbound PHI redaction before the model sees the prompt. Outbound filter blocks any response containing patient identifiers. Redact-and-rekey for clinical-note summarization — the model summarizes against placeholders; the gateway restores the patient context only on the way back to the authorized caller.

Public sector. Jailbreak scoring in real time. Per-locale PII patterns — national IDs, tax IDs, citizen numbers, social-security formats — redacted per locale. Same firewall, different rules per jurisdiction.

Insurance. Inbound injection scoring on documents customers upload. Adversarial documents carrying "ignore previous instructions" get blocked before the prompt ever leaves the cluster.

Media. Outbound filters catch source-code-shaped strings and API keys. Editorial productivity unchanged; risk profile materially better.

Telecom. Locale-specific PII rules across the regions the carrier operates in. Each jurisdiction's national identifiers redacted with the right pattern in the right locale. One firewall, many regional configurations.

Energy. Injection scoring on operations agents — adversarial documents trying to coerce SCADA-adjacent agents into changing parameters. Caught before reaching the agent.

What this doesn't fix

Three things prompt firewalls don't do, and shouldn't pretend to:

Make a hallucination disappear. The firewall blocks dangerous outputs. It doesn't make the model accurate. Hallucination mitigation is a different problem (RAG quality, evaluation, model selection) — the firewall is the runtime safety net, not the truth oracle.

Replace authentication or authorization. A prompt firewall doesn't know whether the consumer is allowed to ask this question. That's the identity and authorization layer's job. The firewall blocks dangerous prompts, not unauthorized ones.

Stop a determined adversary 100% of the time. No firewall is perfect. The goal is materially raising the cost of attack and catching the obvious patterns. Combine the firewall with monitoring, anomaly detection, and human review for high-risk applications.

How it composes with the rest of the AI lane

Prompt firewalls don't live in isolation. They compose with the rest of the AI plane:

  • Routing. Firewall outcomes can drive routing decisions — a flagged-suspicious prompt routes to a hardened model with a stricter system prompt.
  • Semantic cache. A prompt that hits a cached completion never reaches the model. The firewall still runs on the cached response before delivery — the cache doesn't bypass outbound filters.
  • Audit and analytics. Block rates, redaction rates, and score distributions show up on the same Analytics Engine the REST gateway uses. One observability surface for everything.
  • Monitoring. Alarms fire when block rate spikes, when a new injection pattern hits an unusual frequency, or when a particular consumer starts generating unusual scores. The security team notices the threat shift before the threat scales.

What we measure

Four numbers worth watching during a prompt-firewall rollout:

  • Block rate per consumer, per model. Anomalies surface adversarial behavior or misconfigured legitimate use cases.
  • Score distribution. Healthy distributions cluster low with a thin tail. A bimodal distribution suggests two distinct populations — legitimate and adversarial — and that the threshold is in the right place.
  • False-positive rate (sampled). Auditing a sample of blocks against ground truth keeps the threshold honest.
  • Redaction completeness. For PII redaction, sampled checks against ground truth keep the pattern coverage honest as new locales or new identifier shapes appear.

Each of these surfaces on the Analytics Engine.

Safety as a runtime property

This is the framing that matters. Model vendors will keep improving their internal safety. They should. But waiting for that to be enough is not the operating posture of a regulated platform.

Safety has to live at the runtime — in front of the model, between the model and the consumer, in the same policy plane as identity, audit, and rate-limit. When safety is a runtime property, you can update it without retraining a model. You can audit it without parsing weights. You can prove it without arguing about training data.

That's what makes prompt firewalls a load-bearing part of the AI gateway, not a feature on a checklist.

If you want a 30-minute walkthrough of inbound and outbound firewall policies on a Kubernetes of your choice, the team is one call away.


All posts · Book a Demo · Read the docs

© 2026 Apinizer. All rights reserved.