Evals · Free preview

Catch a Regression

v2 must not break v1

A regression eval runs the same test set against the old and new versions and flags only the cases that PASSED before and FAIL now — a new failure you introduced, not a pre-existing one.

v2 must not break v1

You shipped v1; it passed its eval. Then you tweaked the prompt to fix one annoying case, and v2 does fix it — but it also quietly changed how it formats two other answers that used to be correct. You run v2's eval, see "8 of 10 pass," and feel fine. You shouldn't: the headline pass rate hides which cases changed. A model that scores the same 80% on a different 80% has silently traded working behavior for new behavior, and your users feel every swap.

A regression names exactly the bad half of that trade: a case that passed in v1 and now fails in v2 — behavior you broke, not behavior that was always broken. You can't see it from v2 alone, because a v2 failure might have been failing all along. So you run the same test set from lesson one against both versions, score each case with the pass/fail check you built earlier, and diff: keep only the cases that flipped from PASS to FAIL.

Walk the set. c1 ("ping"→"pong") passes in both — fine. But c2 returned "4" in v1 and "four" in v2: PASS then FAIL — a regression. c4 went "10""ten": another. Now the trap: c5 expects "ok" but both versions return "?" — FAIL in both. It's broken, but you didn't break it this time, so it is not a regression and must stay off the list. The diff is c2, c4, and nothing else.

Below, both versions and the shared test set are wired, but the loop only scores v2. Run v1 too and push t.id only when passV1 && !passV2. Done means the final line reads regressions: c2, c4 — the two breaks you introduced, with c5 correctly excluded.

Absolute pass rate tells you how good you are; the regression diff tells you whether your last change made things worse. Ship on the diff.

In the full academy, you write and run this — live, graded:

// One test set, two agent versions. A "regression" is a case that PASSED on the
// old agent (v1) but now FAILS on the new one (v2) — a break YOU just introduced.

// The agents: input -> output. (v2 changed how it rounds and tags some answers.)
function v1(input) {
  const map = { ping: "pong", "2+2": "4", hi: "hello", sum: "10", weird: "?" };
  return map[input];
}
function v2(input) {
  const map = { ping: "pong", "2+2": "four", hi: "hello", sum: "ten", weird: "?" };
  return map[input];
}

// The shared test set: each case has an id, an input, and the expected output.
const tests = [
  { id: "c1", input: "ping", expected: "pong" },
  { id: "c2", input: "2+2", expected: "4" },
  { id: "c3", input: "hi", expected: "hello" },
  { id: "c4", input: "sum", expected: "10" },
  { id: "c5", input: "weird", expected: "ok" },
];

// TODO: a regression needs BOTH versions. Right now we only sco

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace