When tools fail
Every lesson so far assumed the tool works: you defined it, routed to it, folded its result back. But the result you fold back only exists if the call succeeded — and real tools fail constantly. The weather API times out, a rate limit bites, a service blips for 200ms. A naive agent calls the tool once, catches the first transient error, and the whole task dies over a hiccup that would've cleared on the very next try.
A resilient agent treats one failure as information, not a verdict. The mechanism is two ideas together. First, retry: wrap the call in a loop and try again, up to a hard cap, because most transient failures clear within an attempt or two. Second, a bounded fallback: when the cap is hit and every attempt failed, don't crash — return a safe default and say so. The cap is what keeps "retry" from becoming an infinite hammer on a service that's genuinely down.
Trace the worked case. The tool throws "flaky network" on attempt 1, throws again
on attempt 2, then returns 42 on attempt 3. A resilient loop logs each failure,
keeps going, and on attempt 3 reports ok -> 42 (succeeded on attempt 3). Had all
five allowed attempts failed instead, it would print one fallback line rather than
throw. Either way the loop always terminates with a definite outcome — a result
or a fallback, never an unhandled exception. That's the difference between a demo
and something you'd put in front of users: bounded cost, and a known final state
callers downstream can rely on.
Below is a deliberately flaky tool — fails on attempts 1 and 2, returns 42 on
attempt 3. The starter calls it once and gives up. Rewrite it to loop up to
maxAttempts: log each failed attempt, and on success print the result and which
attempt won; if every attempt fails, print one safe fallback line. Done is the
output ending in succeeded on attempt 3.
Retries plus a bounded fallback turn a flaky tool into a dependable one — the agent keeps its promise to the loop even when the world misbehaves.