Score against a rubric
Exact-match works when there's one right answer. But most agent output — a support reply, a summary, an explanation — is open-ended. Two correct answers can share zero characters, so a string compare tells you nothing. The move is LLM-as-judge: hand a second model a rubric and have it score each criterion, so "is this good?" becomes a structured, repeatable number you can track release over release.
The catch most people miss: the judge model is only half the system. The model is the brain — it reads an answer and decides whether each criterion holds. But the harness around it is what makes that into an eval: it applies the rubric (pass only if every criterion holds), names which criterion broke when one does, and aggregates the per-case verdicts into a score. A vague rubric or a lenient harness produces a judge that rubber-stamps everything — a number that always says 100% and therefore measures nothing.
First, watch the judge read a support reply, weigh it against three rubric criteria one at a time, and return a verdict with a score.
Now build the harness yourself — not the model's brain, but the logic that wraps it.
You're given three candidate answers with their rubric facts already extracted
(hasCitation, onTopic) — exactly what a judge model would hand back. Write
judge(c) so it returns PASS only when every criterion holds, and otherwise
returns the first failed criterion as the reason. The starter rubber-stamps
everything (3/3 passed); make it actually grade, so it prints PASS,
FAIL — missing citation, FAIL — off topic, and the aggregated score: 1/3 passed.
A judge is only as good as its rubric — and the rubric lives in the harness, not the model. Spell out every criterion, fail on the first one that breaks, and a fuzzy judgment becomes an eval you can run on every release.