Measure before you improve
You tweak the system prompt, the agent feels sharper, you ship. Next week a user asks something it used to nail and it flubs it. Did your change help or hurt? You have no idea — you were steering on a vibe. "It seems better" is the most expensive sentence in agent-building, because it lets a regression slip past you unmeasured.
The fix is a labeled dataset: a small list of cases, each pairing an input
with the gold answer you already know is right. That fixed set is your ruler. Run
any version of the agent against it and you get a number you can compare — version
A scored 7/10, version B scores 9/10, ship B. The cases never move, so the only
thing that changes the score is the agent. That's what turns "seems better" into
"is better."
Concretely, take a tiny add task. Your cases might be { input: "2 + 2", gold: 4 },
{ input: "3 + 5", gold: 8 }, { input: "10 + 1", gold: 11 }. Three lines of data —
but now any future agent has a fixed target to hit, and you can spot the moment one
of these three breaks. A handful of real cases your users actually send beats a
hundred you invented; the dataset is the asset, and it only grows.
Below, the print loop is wired for you, but the cases array is empty — so it
reports 0 cases and there's nothing to score. Fill it with at least three
{ input, gold } cases for the add task so it prints each
case <i>: <input> => <gold> line followed by the total <N> cases. Done means a
real test set exists as data, ready for the next lesson to score.
The dataset comes first. Everything else in evals — scoring, asserting, regression-checking — is something you run against these labeled cases.