Say no to out-of-policy
The hardest thing to teach an eager agent is to not do something. By default a capable model wants to be helpful, and "helpful" plus an out-of-policy request is how an agent leaks a secret, wires money to an unverified payee, or quietly turns off the logging that would have caught it. So a deployed agent isn't just a model on its own — it's a model wrapped in a guardrail: a check that runs before the agent acts and decides whether this request is even allowed.
A good guardrail does three things on a refusal. It names the policy that was crossed, it declines the unsafe action clearly, and it points to the safe path instead of silently complying. "I can't print the admin password" beats a generic "Error," because the reason is the thing the user (or the next reviewer) needs to act on.
Watch this guarded agent recognize a high-risk money transfer, check it against policy, and decline — offering the approval route rather than executing.
Now build that mechanism yourself. Below is a fixed policy and four requests —
two harmless, two not. The print loop and the counters are already wired; your
job is the guardrail itself: implement classify(req) so it returns
{ allow: false, reason } for requests that expose secrets (the admin
password) or disable safety (security logging), and { allow: true } for the
ordinary ones. Get it right and the run ends with 2 allowed, 2 refused.
Notice the sharp edge in your own solution: it matches on literal words. That is the central tension of a keyword guardrail — it is easy to reason about, but it fails in two directions. Too loose and "what's our password policy?" gets refused as a secrets leak (a false alarm that trains users to route around you); too narrow and "turn off audit trails" sails through because it never says "logging." Real guards layer a cheap keyword pass for the obvious cases over a model-based classifier for intent, and when the two disagree they fail closed — refuse and escalate — rather than guess.
A refusal isn't the agent failing — it's the guardrail working. The policy runs first, names what it blocked, and lets only the safe requests through.