Evals · Free preview

Exact-Match Scoring

Right or wrong, counted

The simplest eval metric is normalized exact match: trim and lowercase each prediction and gold answer, mark it correct only if they match, and report the fraction correct as accuracy.

Right or wrong, counted

Last lesson you built a test set of { input, gold } cases. A test set with no scorer is just a list — to turn it into a number you need a metric, and the simplest one is exact match: for each case, hold the model's prediction next to the gold answer and ask, did it nail it? Count the hits, divide by the total, and you have accuracy — the single number that lets you compare two agent versions.

The trap is comparing raw strings. A model that replies " Yes " is correct — but " Yes " === "yes" is false, so naive comparison would mark a good answer wrong and tank your score with noise. You normalize both sides first: trim() the whitespace and lowercase, so " Yes " and "yes", "True" and "true" collapse to the same thing. Now equality means meaningfully equal, not byte-identical.

Walk the batch you're about to grade. "Paris" normalizes to paris and matches its gold — correct. " Yes "yes — correct. "True"true — correct. But "42" against gold "forty-two"? No amount of trimming makes those equal — wrong. Three of four, so accuracy: 75%. That last case is the point: exact match is honest about surface form, even when the meaning is identical, which is both its discipline and its limit.

Below is that four-case batch. The scaffold already prints each verdict and the final accuracy — but it cheats, hard-coding correct = true so everything passes and you get a meaningless 100%. Replace it with a real normalized comparison so case 2 comes out wrong and accuracy lands at 75%.

Exact match is blunt — it can't tell "Paris, France" from "paris" — but it's cheap, deterministic, and the first metric you should reach for.

In the full academy, you write and run this — live, graded:

// A batch of model predictions against the gold (correct) answers.
const cases = [
  { pred: "Paris", gold: "paris" },
  { pred: " Yes ", gold: "yes" },
  { pred: "42", gold: "forty-two" },
  { pred: "True", gold: "true" },
];

let hits = 0;
cases.forEach((c, i) => {
  // TODO: a prediction is correct only if, after trim + lowercase,
  // it EXACTLY equals the gold answer. Right now we wave everything through.
  const correct = true;

  if (correct) hits += 1;
  console.log(`${i}: ${correct ? "correct" : "wrong"}`);
});

const accuracy = Math.round((hits / cases.length) * 100);
console.log(`accuracy: ${accuracy}%`);

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace