Evals · Free preview

Why Evaluate

Measure before you improve

Evaluation starts with a labeled dataset — a small list of {input, gold} cases — so you have a fixed target to score every version of your agent against.

Measure before you improve

You tweak the system prompt, the agent feels sharper, you ship. Next week a user asks something it used to nail and it flubs it. Did your change help or hurt? You have no idea — you were steering on a vibe. "It seems better" is the most expensive sentence in agent-building, because it lets a regression slip past you unmeasured.

The fix is a labeled dataset: a small list of cases, each pairing an input with the gold answer you already know is right. That fixed set is your ruler. Run any version of the agent against it and you get a number you can compare — version A scored 7/10, version B scores 9/10, ship B. The cases never move, so the only thing that changes the score is the agent. That's what turns "seems better" into "is better."

Concretely, take a tiny add task. Your cases might be { input: "2 + 2", gold: 4 }, { input: "3 + 5", gold: 8 }, { input: "10 + 1", gold: 11 }. Three lines of data — but now any future agent has a fixed target to hit, and you can spot the moment one of these three breaks. A handful of real cases your users actually send beats a hundred you invented; the dataset is the asset, and it only grows.

Below, the print loop is wired for you, but the cases array is empty — so it reports 0 cases and there's nothing to score. Fill it with at least three { input, gold } cases for the add task so it prints each case <i>: <input> => <gold> line followed by the total <N> cases. Done means a real test set exists as data, ready for the next lesson to score.

The dataset comes first. Everything else in evals — scoring, asserting, regression-checking — is something you run against these labeled cases.

In the full academy, you write and run this — live, graded:

// Evaluation begins with a labeled dataset: a list of {input, gold} cases.
// Here the "task" is tiny: add the two numbers in 'input'; 'gold' is the right answer.

// TODO: fill this with at least 3 cases, e.g. { input: "2 + 2", gold: 4 }.
const cases = [];

// Print each case, then the total count. (Already wired — you just need the data.)
cases.forEach((c, i) => {
  console.log(`case ${i}: ${c.input} => ${c.gold}`);
});
console.log(`${cases.length} cases`);

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace