Do not leak secrets
You validated the request and you neutralized injected directives — two guardrails
on the way in. But the agent can still betray you on the way out. A support
agent has a live API key in its environment and the company email in its context,
and a user asks "how do I authenticate?" The model, trying to be maximally helpful,
writes: "Here is your access: key sk-LIVE-7f3aQ29x and reach us at
support@acme.io." Nothing was hacked. The model just said the secret — and once
those characters reach the user, the key is burned and must be rotated. The injection
lesson kept the secret from being extracted; this one keeps it from being
volunteered.
An output guardrail is the scan you run on the draft after the model writes it
and before anything ships. The key idea: the model proposes the words, but you get
the last edit. You don't ask the model to be careful — you don't trust it to be — you
mechanically scan its output for patterns that must never leave the loop and redact
them. A leaked secret is a trust failure, not just a bug: a single exposed
sk-LIVE key in a chat transcript is a credential anyone who sees it can use.
The mechanism is pattern-match-and-replace. You have a regex for an API key
(sk- followed by letters and digits) and one for an email. Run the draft through
both, swapping each hit for [REDACTED] and counting as you go. Walk the example:
"...key sk-LIVE-7f3aQ29x and reach us at support@acme.io." becomes "...key [REDACTED] and reach us at [REDACTED]." — two matches removed, so the guard reports
redacted: 2. That count isn't decoration; it's your signal that the guardrail
actually fired and how often, which is exactly what you'd log and alert on in
production.
Finish guard: replace every API_KEY and EMAIL match with [REDACTED], count
how many you pulled out, and return the cleaned text so it prints with redacted: 2.
The model proposes the words; your guardrail decides what's safe to send. Never trust a draft to be clean on its own.