A correct answer that's slow or pricey can still lose
Your regression suite is green. Version B answers every case exactly as well as version A — same accuracy, same gold matches. So you ship B... and your token bill jumps 70% and the p95 response time crawls. B was "correct" the whole time. Your eval just never looked at what each correct answer cost.
Accuracy is one axis. Cost (tokens, dollars) and latency (milliseconds) are two more, and they're the ones that decide whether a working agent is actually deployable. A version that's right but twice as expensive, or right but twice as slow, can be the wrong choice — and an eval that only counts correct answers will wave it straight through. So you measure all three: score and budget.
The mechanism is just bookkeeping. Each eval case records what it consumed —
{ tokens, ms } — and per version you sum across the cases to get a total. Say
version A's three cases cost 120 + 90 + 150 = 360 tokens and 80 + 110 + 70 = 260
ms, while B's cost 80 + 70 + 60 = 210 tokens and 140 + 130 + 160 = 430 ms. Now
the tradeoff is legible: B is cheaper (210 < 360 tokens) but A is faster
(260 < 430 ms). That single comparison is the whole point — it turns "both pass" into
a decision you can defend.
Below, the two versions and their per-case costs are wired, and the print + verdict
lines are ready — but total() never actually adds anything up, so both versions
report 0 tokens, 0 ms and the verdict is a meaningless tie. Make total() sum
tokens and ms across the cases so it prints each version's real totals followed
by cheaper: and faster:.
An eval that only scores correctness silently picks the most expensive way to be right — measure cost and latency or you're flying blind on the bill.