Evals · Free preview

Assertion Checks

Test behavior, not just strings

An assertion eval checks a property of the output — it contains a phrase, matches a format, or falls in a numeric range — instead of demanding one exact string.

Test behavior, not just strings

Exact match scored you 75% last lesson — but it can't grade free-form replies at all. A billing agent that says "Your refund of $129.50 will arrive in 5 business days" is a great answer, but it will never equal any single gold string you write down: the wording and order drift run to run. Hold a free-form reply to byte-equality and every good answer fails. You don't care about the exact string here — you care about a property of it.

An assertion eval checks that property directly. Instead of output === gold, you ask narrower yes/no questions: does the reply contain "refund"? Does it match a $00.00 price format? Is the extracted total in range? Each is a boolean, and the reply passes only if it satisfies the ones that matter. This is exactly how this academy grades your code — not by exact text, but by checking your output's shape.

Take that refund reply. contains(reply, "refund") → the substring is there, PASS. matchesFormat(reply, /\$\d+\.\d{2}/)$129.50 fits the pattern, PASS. But inRange(129.5, 0, 100) → 129.5 is over budget, FAIL — and a real assertion has to say so, not wave it through. That honest FAIL is the whole value: assertions flag "refund went out, but for too much money," a defect exact match can't even see.

Below, that agent answered a billing question and three assertion functions are stubbed to always return true — rubber-stamping every check. Fill them in so they test the reply for real: contains runs .includes, matchesFormat runs regex.test, inRange checks lo <= n <= hi. Done means the two passes stay PASS and the range check turns to FAIL.

A good eval is a question with a yes/no answer about the output — substring, format, range. String-equality is just the weakest assertion of them all.

In the full academy, you write and run this — live, graded:

// An agent answered a billing question. We grade its reply with assertions —
// richer than exact-match: a phrase it must contain, a format it must match,
// and a number that must land in range.
const reply = "Your refund of $129.50 will arrive in 5 business days.";
const total = 129.5; // dollar amount the agent extracted from the reply

// Each assertion takes the reply (and value) and returns true/false.
function contains(text, needle) {
  // TODO: does the reply contain this substring?
  return true;
}

function matchesFormat(text, re) {
  // TODO: does the reply match this format (regex)?
  return true;
}

function inRange(n, lo, hi) {
  // TODO: is n within [lo, hi] inclusive?
  return true;
}

const report = (label, ok) => console.log(`${label} -> ${ok ? "PASS" : "FAIL"}`);

report('contains "refund"', contains(reply, "refund"));
report("matches price format", matchesFormat(repl

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace