Safety & Deploy · Free preview

Refusing Safely

Say no to out-of-policy

A deployed agent must sometimes refuse: when a request violates policy, the safe move is to check the rule, decline clearly with the reason, and offer the in-policy path — not to comply because it was asked. The guardrail that does this is a classifier that runs before the agent acts.

Say no to out-of-policy

The hardest thing to teach an eager agent is to not do something. By default a capable model wants to be helpful, and "helpful" plus an out-of-policy request is how an agent leaks a secret, wires money to an unverified payee, or quietly turns off the logging that would have caught it. So a deployed agent isn't just a model on its own — it's a model wrapped in a guardrail: a check that runs before the agent acts and decides whether this request is even allowed.

A good guardrail does three things on a refusal. It names the policy that was crossed, it declines the unsafe action clearly, and it points to the safe path instead of silently complying. "I can't print the admin password" beats a generic "Error," because the reason is the thing the user (or the next reviewer) needs to act on.

Watch this guarded agent recognize a high-risk money transfer, check it against policy, and decline — offering the approval route rather than executing.

Now build that mechanism yourself. Below is a fixed policy and four requests — two harmless, two not. The print loop and the counters are already wired; your job is the guardrail itself: implement classify(req) so it returns { allow: false, reason } for requests that expose secrets (the admin password) or disable safety (security logging), and { allow: true } for the ordinary ones. Get it right and the run ends with 2 allowed, 2 refused.

Notice the sharp edge in your own solution: it matches on literal words. That is the central tension of a keyword guardrail — it is easy to reason about, but it fails in two directions. Too loose and "what's our password policy?" gets refused as a secrets leak (a false alarm that trains users to route around you); too narrow and "turn off audit trails" sails through because it never says "logging." Real guards layer a cheap keyword pass for the obvious cases over a model-based classifier for intent, and when the two disagree they fail closed — refuse and escalate — rather than guess.

A refusal isn't the agent failing — it's the guardrail working. The policy runs first, names what it blocked, and lets only the safe requests through.

In the full academy, you watch a real agent run this — live:

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace