Safety & Deploy · Free preview

Filter the Output

Do not leak secrets

An output guardrail scans what the agent is about to say and redacts sensitive patterns — like API keys and emails — before the text is ever returned.

Do not leak secrets

You validated the request and you neutralized injected directives — two guardrails on the way in. But the agent can still betray you on the way out. A support agent has a live API key in its environment and the company email in its context, and a user asks "how do I authenticate?" The model, trying to be maximally helpful, writes: "Here is your access: key sk-LIVE-7f3aQ29x and reach us at support@acme.io." Nothing was hacked. The model just said the secret — and once those characters reach the user, the key is burned and must be rotated. The injection lesson kept the secret from being extracted; this one keeps it from being volunteered.

An output guardrail is the scan you run on the draft after the model writes it and before anything ships. The key idea: the model proposes the words, but you get the last edit. You don't ask the model to be careful — you don't trust it to be — you mechanically scan its output for patterns that must never leave the loop and redact them. A leaked secret is a trust failure, not just a bug: a single exposed sk-LIVE key in a chat transcript is a credential anyone who sees it can use.

The mechanism is pattern-match-and-replace. You have a regex for an API key (sk- followed by letters and digits) and one for an email. Run the draft through both, swapping each hit for [REDACTED] and counting as you go. Walk the example: "...key sk-LIVE-7f3aQ29x and reach us at support@acme.io." becomes "...key [REDACTED] and reach us at [REDACTED]." — two matches removed, so the guard reports redacted: 2. That count isn't decoration; it's your signal that the guardrail actually fired and how often, which is exactly what you'd log and alert on in production.

Finish guard: replace every API_KEY and EMAIL match with [REDACTED], count how many you pulled out, and return the cleaned text so it prints with redacted: 2.

The model proposes the words; your guardrail decides what's safe to send. Never trust a draft to be clean on its own.

In the full academy, you write and run this — live, graded:

// An output guardrail. The agent has drafted a reply — but it leaked secrets.
// Scan the draft, REDACT every sensitive pattern, and only THEN return it.
const draft =
  "Here is your access: key sk-LIVE-7f3aQ29x and reach us at support@acme.io.";

// Patterns a deployed agent must never let through:
const API_KEY = /sk-[A-Za-z0-9-]+/g;     // an API key: sk- then letters/digits
const EMAIL = /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g; // an email

function guard(text) {
  let redacted = 0;
  // TODO: replace each API_KEY and EMAIL match with "[REDACTED]",
  //       counting how many you removed, then return the safe text.
  return { text, redacted };
}

const result = guard(draft);
console.log(result.text);
console.log("redacted:", result.redacted);

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace