Evals · Free preview

LLM as Judge

Score against a rubric

When outputs are open-ended, you can't exact-match them — so you give a second model a rubric and have it grade each criterion, turning fuzzy quality into a structured score. The model is the brain; you build the harness that applies the rubric and aggregates the verdicts.

Score against a rubric

Exact-match works when there's one right answer. But most agent output — a support reply, a summary, an explanation — is open-ended. Two correct answers can share zero characters, so a string compare tells you nothing. The move is LLM-as-judge: hand a second model a rubric and have it score each criterion, so "is this good?" becomes a structured, repeatable number you can track release over release.

The catch most people miss: the judge model is only half the system. The model is the brain — it reads an answer and decides whether each criterion holds. But the harness around it is what makes that into an eval: it applies the rubric (pass only if every criterion holds), names which criterion broke when one does, and aggregates the per-case verdicts into a score. A vague rubric or a lenient harness produces a judge that rubber-stamps everything — a number that always says 100% and therefore measures nothing.

First, watch the judge read a support reply, weigh it against three rubric criteria one at a time, and return a verdict with a score.

Now build the harness yourself — not the model's brain, but the logic that wraps it. You're given three candidate answers with their rubric facts already extracted (hasCitation, onTopic) — exactly what a judge model would hand back. Write judge(c) so it returns PASS only when every criterion holds, and otherwise returns the first failed criterion as the reason. The starter rubber-stamps everything (3/3 passed); make it actually grade, so it prints PASS, FAIL — missing citation, FAIL — off topic, and the aggregated score: 1/3 passed.

A judge is only as good as its rubric — and the rubric lives in the harness, not the model. Spell out every criterion, fail on the first one that breaks, and a fuzzy judgment becomes an eval you can run on every release.

In the full academy, you watch a real agent run this — live:

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace