Keyword Retrieval

Score by overlap

You can chunk a document now, but a pile of chunks isn't an answer — given a question you need to rank them and surface the few that matter. The simplest retriever that does this understands no meaning at all; it just counts word overlap. For each chunk you ask one question: how many of the query's distinct words appear in here? That count is the chunk's score. Rank by score, keep the highest few, and you have a working keyword retriever — the same idea behind classic search engines, built in a dozen lines.

Two details keep it honest. Distinct query words: count each query word at most once, so a chunk that repeats dog ten times doesn't beat one that actually covers the topic. And stay literal — this matches surface strings, not meaning, which is exactly why it's so explainable.

Work the example. Query fast dog cat has three distinct words. The chunk "The dog ran fast and the cat watched" contains all three → score 3. "A fast dog is a good dog" has fast and dog but not cat → score 2 (the second dog doesn't add anything — distinct words only). "Birds fly high in the sky" shares nothing → score 0, so it never reaches your top two. Sort descending, slice the top two, and you've returned the two chunks a human would have picked too.

Why it matters in a real agent: lexical retrieval is cheap, instant, needs no model call, and every result is auditable — you can point at the exact words that matched and explain the ranking. Its blind spot is synonyms: ask about a "canine" and the dog chunks score zero, because the strings don't overlap. That gap is what the next lesson's embeddings exist to close.

Below you have four chunks and the query fast dog cat. Finish score() to count how many distinct query words appear in a chunk, rank all four (highest first, ties keep document order), and print the top two as score N: <chunk>. Done means score-3 and score-2 chunks print, in that order, and the zero-overlap chunks are gone.

Lexical retrieval is blunt but fast and transparent: every result is one you can explain by pointing at the words that matched.

Score by overlap

This is 1 of 50 lessons.