Right or wrong, counted
Last lesson you built a test set of { input, gold } cases. A test set with no
scorer is just a list — to turn it into a number you need a metric, and the
simplest one is exact match: for each case, hold the model's prediction next to
the gold answer and ask, did it nail it? Count the hits, divide by the total, and
you have accuracy — the single number that lets you compare two agent versions.
The trap is comparing raw strings. A model that replies " Yes " is correct — but
" Yes " === "yes" is false, so naive comparison would mark a good answer wrong
and tank your score with noise. You normalize both sides first: trim() the
whitespace and lowercase, so " Yes " and "yes", "True" and "true" collapse to
the same thing. Now equality means meaningfully equal, not byte-identical.
Walk the batch you're about to grade. "Paris" normalizes to paris and matches its
gold — correct. " Yes " → yes — correct. "True" → true — correct. But "42"
against gold "forty-two"? No amount of trimming makes those equal — wrong. Three of
four, so accuracy: 75%. That last case is the point: exact match is honest about
surface form, even when the meaning is identical, which is both its discipline and
its limit.
Below is that four-case batch. The scaffold already prints each verdict and the final
accuracy — but it cheats, hard-coding correct = true so everything passes and you
get a meaningless 100%. Replace it with a real normalized comparison so case 2 comes
out wrong and accuracy lands at 75%.
Exact match is blunt — it can't tell "Paris, France" from "paris" — but it's cheap, deterministic, and the first metric you should reach for.