Catch a Regression

v2 must not break v1

You shipped v1; it passed its eval. Then you tweaked the prompt to fix one annoying case, and v2 does fix it — but it also quietly changed how it formats two other answers that used to be correct. You run v2's eval, see "8 of 10 pass," and feel fine. You shouldn't: the headline pass rate hides which cases changed. A model that scores the same 80% on a different 80% has silently traded working behavior for new behavior, and your users feel every swap.

A regression names exactly the bad half of that trade: a case that passed in v1 and now fails in v2 — behavior you broke, not behavior that was always broken. You can't see it from v2 alone, because a v2 failure might have been failing all along. So you run the same test set from lesson one against both versions, score each case with the pass/fail check you built earlier, and diff: keep only the cases that flipped from PASS to FAIL.

Walk the set. c1 ("ping"→"pong") passes in both — fine. But c2 returned "4" in v1 and "four" in v2: PASS then FAIL — a regression. c4 went "10" → "ten": another. Now the trap: c5 expects "ok" but both versions return "?" — FAIL in both. It's broken, but you didn't break it this time, so it is not a regression and must stay off the list. The diff is c2, c4, and nothing else.

Below, both versions and the shared test set are wired, but the loop only scores v2. Run v1 too and push t.id only when passV1 && !passV2. Done means the final line reads regressions: c2, c4 — the two breaks you introduced, with c5 correctly excluded.

Absolute pass rate tells you how good you are; the regression diff tells you whether your last change made things worse. Ship on the diff.

v2 must not break v1

This is 1 of 50 lessons.