Exact-Match Scoring

Right or wrong, counted

Last lesson you built a test set of { input, gold } cases. A test set with no scorer is just a list — to turn it into a number you need a metric, and the simplest one is exact match: for each case, hold the model's prediction next to the gold answer and ask, did it nail it? Count the hits, divide by the total, and you have accuracy — the single number that lets you compare two agent versions.

The trap is comparing raw strings. A model that replies " Yes " is correct — but " Yes " === "yes" is false, so naive comparison would mark a good answer wrong and tank your score with noise. You normalize both sides first: trim() the whitespace and lowercase, so " Yes " and "yes", "True" and "true" collapse to the same thing. Now equality means meaningfully equal, not byte-identical.

Walk the batch you're about to grade. "Paris" normalizes to paris and matches its gold — correct. " Yes " → yes — correct. "True" → true — correct. But "42" against gold "forty-two"? No amount of trimming makes those equal — wrong. Three of four, so accuracy: 75%. That last case is the point: exact match is honest about surface form, even when the meaning is identical, which is both its discipline and its limit.

Below is that four-case batch. The scaffold already prints each verdict and the final accuracy — but it cheats, hard-coding correct = true so everything passes and you get a meaningless 100%. Replace it with a real normalized comparison so case 2 comes out wrong and accuracy lands at 75%.

Exact match is blunt — it can't tell "Paris, France" from "paris" — but it's cheap, deterministic, and the first metric you should reach for.

Right or wrong, counted

This is 1 of 50 lessons.