Filter the Output

Do not leak secrets

You validated the request and you neutralized injected directives — two guardrails on the way in. But the agent can still betray you on the way out. A support agent has a live API key in its environment and the company email in its context, and a user asks "how do I authenticate?" The model, trying to be maximally helpful, writes: "Here is your access: key sk-LIVE-7f3aQ29x and reach us at support@acme.io." Nothing was hacked. The model just said the secret — and once those characters reach the user, the key is burned and must be rotated. The injection lesson kept the secret from being extracted; this one keeps it from being volunteered.

An output guardrail is the scan you run on the draft after the model writes it and before anything ships. The key idea: the model proposes the words, but you get the last edit. You don't ask the model to be careful — you don't trust it to be — you mechanically scan its output for patterns that must never leave the loop and redact them. A leaked secret is a trust failure, not just a bug: a single exposed sk-LIVE key in a chat transcript is a credential anyone who sees it can use.

The mechanism is pattern-match-and-replace. You have a regex for an API key (sk- followed by letters and digits) and one for an email. Run the draft through both, swapping each hit for [REDACTED] and counting as you go. Walk the example: "...key sk-LIVE-7f3aQ29x and reach us at support@acme.io." becomes "...key [REDACTED] and reach us at [REDACTED]." — two matches removed, so the guard reports redacted: 2. That count isn't decoration; it's your signal that the guardrail actually fired and how often, which is exactly what you'd log and alert on in production.

Finish guard: replace every API_KEY and EMAIL match with [REDACTED], count how many you pulled out, and return the cleaned text so it prints with redacted: 2.

The model proposes the words; your guardrail decides what's safe to send. Never trust a draft to be clean on its own.

Do not leak secrets

This is 1 of 50 lessons.