Evals · Free preview

Cost & Latency

A correct answer that's slow or pricey can still lose

Accuracy is one axis; cost and latency are two more — an eval that only measures correctness will happily ship the version that doubles your token bill or your p95 response time.

A correct answer that's slow or pricey can still lose

Your regression suite is green. Version B answers every case exactly as well as version A — same accuracy, same gold matches. So you ship B... and your token bill jumps 70% and the p95 response time crawls. B was "correct" the whole time. Your eval just never looked at what each correct answer cost.

Accuracy is one axis. Cost (tokens, dollars) and latency (milliseconds) are two more, and they're the ones that decide whether a working agent is actually deployable. A version that's right but twice as expensive, or right but twice as slow, can be the wrong choice — and an eval that only counts correct answers will wave it straight through. So you measure all three: score and budget.

The mechanism is just bookkeeping. Each eval case records what it consumed — { tokens, ms } — and per version you sum across the cases to get a total. Say version A's three cases cost 120 + 90 + 150 = 360 tokens and 80 + 110 + 70 = 260 ms, while B's cost 80 + 70 + 60 = 210 tokens and 140 + 130 + 160 = 430 ms. Now the tradeoff is legible: B is cheaper (210 < 360 tokens) but A is faster (260 < 430 ms). That single comparison is the whole point — it turns "both pass" into a decision you can defend.

Below, the two versions and their per-case costs are wired, and the print + verdict lines are ready — but total() never actually adds anything up, so both versions report 0 tokens, 0 ms and the verdict is a meaningless tie. Make total() sum tokens and ms across the cases so it prints each version's real totals followed by cheaper: and faster:.

An eval that only scores correctness silently picks the most expensive way to be right — measure cost and latency or you're flying blind on the bill.

In the full academy, you write and run this — live, graded:

// Two agent versions ran the SAME eval cases. Each case records what it cost:
// { tokens, ms }. Accuracy was a tie — now budget decides which version ships.
const A = [
  { tokens: 120, ms: 80 },
  { tokens: 90, ms: 110 },
  { tokens: 150, ms: 70 },
];
const B = [
  { tokens: 80, ms: 140 },
  { tokens: 70, ms: 130 },
  { tokens: 60, ms: 160 },
];

function total(cases) {
  // TODO: sum the tokens and ms across every case in 'cases'.
  // Right now it pretends every version is free and instant.
  let tokens = 0;
  let ms = 0;
  // (add c.tokens and c.ms for each case here)
  return { tokens, ms };
}

const ta = total(A);
const tb = total(B);
console.log(`A: ${ta.tokens} tokens, ${ta.ms} ms`);
console.log(`B: ${tb.tokens} tokens, ${tb.ms} ms`);
console.log(`cheaper: ${ta.tokens <= tb.tokens ? "A" : "B"}`);
console.log(`faster: ${ta.ms <= tb.ms ? "A" : "B"}`);

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace