High stakes need approval
The guardrails so far validate, sanitize, and redact text. But your agent doesn't
just talk — it acts. It calls tools that move money, delete records, send wires.
And a model is wrong often enough that "trust it to never misfire on an irreversible
action" is not a plan. Picture the agent confidently deciding to delete-account
for the wrong user, or reading "refund the customer" and issuing $500 instead of
$5, or send-money $200 to an address it half-hallucinated. The text guardrails
wave all of these straight through, because the output looks perfectly clean — the
danger isn't in the words, it's in the consequence.
The fix isn't to make the agent timid; it's to put a human in the loop for the
actions that can't be taken back. You classify each proposed action by risk: the
reversible, low-stakes ones run on their own, and the irreversible or expensive ones
pause with NEEDS APPROVAL until a person signs off. The intuition is asymmetry
of regret — auto-approving a $5 refund that turns out wrong costs five dollars and
a moment; auto-approving a delete-account that turns out wrong can't be undone at
any price. So you spend a human's attention only where being wrong is expensive.
Notice the threshold, not just the category. delete-account and send-money
are always gated — there's no safe version. But refund is gated by amount:
over $100 needs a human, while the $5 refund sails through on ALLOW. That's the
difference between a blanket block (which makes the agent useless) and risk-tiering
(which keeps it fast where it's safe). Walking the queue, you'd see read-record -> ALLOW, refund $5 -> ALLOW, then refund $500, delete-account, and send-money $200 all routed to a human.
Finish gate(action): return "NEEDS APPROVAL" for every delete-account, every
send-money, and any refund over $100; return "ALLOW" for everything else,
including the $5 refund. Print one "<label> -> <decision>" line per action.
Automate the reversible; gate the irreversible. The point of a human in the loop is to be there exactly when being wrong is expensive — and nowhere else.