Retrieval · Free preview

Keyword Retrieval

Score by overlap

Lexical retrieval ranks chunks by how many distinct query words each one contains, then returns the top-k highest-scoring chunks.

Score by overlap

You can chunk a document now, but a pile of chunks isn't an answer — given a question you need to rank them and surface the few that matter. The simplest retriever that does this understands no meaning at all; it just counts word overlap. For each chunk you ask one question: how many of the query's distinct words appear in here? That count is the chunk's score. Rank by score, keep the highest few, and you have a working keyword retriever — the same idea behind classic search engines, built in a dozen lines.

Two details keep it honest. Distinct query words: count each query word at most once, so a chunk that repeats dog ten times doesn't beat one that actually covers the topic. And stay literal — this matches surface strings, not meaning, which is exactly why it's so explainable.

Work the example. Query fast dog cat has three distinct words. The chunk "The dog ran fast and the cat watched" contains all three → score 3. "A fast dog is a good dog" has fast and dog but not cat → score 2 (the second dog doesn't add anything — distinct words only). "Birds fly high in the sky" shares nothing → score 0, so it never reaches your top two. Sort descending, slice the top two, and you've returned the two chunks a human would have picked too.

Why it matters in a real agent: lexical retrieval is cheap, instant, needs no model call, and every result is auditable — you can point at the exact words that matched and explain the ranking. Its blind spot is synonyms: ask about a "canine" and the dog chunks score zero, because the strings don't overlap. That gap is what the next lesson's embeddings exist to close.

Below you have four chunks and the query fast dog cat. Finish score() to count how many distinct query words appear in a chunk, rank all four (highest first, ties keep document order), and print the top two as score N: <chunk>. Done means score-3 and score-2 chunks print, in that order, and the zero-overlap chunks are gone.

Lexical retrieval is blunt but fast and transparent: every result is one you can explain by pointing at the words that matched.

In the full academy, you write and run this — live, graded:

// Lexical retrieval: rank chunks by how many query words each one contains.
const chunks = [
  "The cat sat on the mat",
  "The dog ran fast and the cat watched",
  "Birds fly high in the sky",
  "A fast dog is a good dog",
];
const query = "fast dog cat";

// The distinct words we're searching for.
const queryWords = [...new Set(query.toLowerCase().split(/\s+/))];

// score(chunk) -> number of DISTINCT query words that appear in the chunk.
function score(chunk) {
  const words = new Set(chunk.toLowerCase().split(/\s+/));
  // TODO: count how many of queryWords are present in this chunk.
  return 0;
}

// TODO: rank chunks by score (highest first, ties keep original order)
// and keep only the top 2. Right now we just take the first 2 unscored.
const top = chunks.slice(0, 2);

for (const c of top) {
  console.log(`score ${score(c)}: ${c}`);
}

🔒 Live code execution, real agent runs, mastery tracking and verifiable credentials unlock with the full academy.

This is 1 of 50 lessons.

The full academy: write real code, watch real agents run, and earn verifiable credentials — across 8 tracks, in a 3D campus.

Unlock the full academy — $100 →

14-day refund · 🔒 Stripe-secured checkout · lifetime access

More free lessons: An LLM Is a Function  ·  The Agent Loop  ·  Define a Tool  ·  Give an Agent a Tool  ·  Durable State

← The Agent Marketplace