Semantic Similarity

Meaning as vectors

The keyword retriever you just built has one fatal blind spot: it matches the words you typed, nothing more. Ask it about a "canine" and the chunk about "the dog" scores zero — same idea, zero shared strings. Real users paraphrase constantly, so a string-matcher misses the right chunk and the agent answers from a worse one (or refuses when it shouldn't). Semantic search closes that gap by matching meaning.

Meaning lives in an embedding: a model converts a chunk into a vector — a direction in space — where things that mean the same thing point nearly the same way, regardless of which words they use. "The dog ran" and "the canine sprinted" land close together; "fish swim deep" points elsewhere. To rank, you measure the cosine similarity between the query's vector and each chunk's vector: the cosine of the angle between them, 1 when they point identically and shrinking toward 0 as they diverge. Crucially, cosine looks at direction, not length — a long chunk and a short one about the same topic still score high, because dividing by each vector's magnitude normalizes the size away.

Work the example. The query points roughly [0.8, 0.3, 0.0]. Chunk c1 at [0.9, 0.1, 0.0] points almost the same way → cosine ≈ 0.97, the top hit. c4 at [0.6, 0.5, 0.1] is close but tilts → ≈ 0.94. But c2 at [0.2, 0.9, 0.1] leans into the second axis, nearly perpendicular to the query — a low score even though it sits second in storage order. Ranking by similarity, not by where a chunk happened to be stored, is the entire point.

Why it matters: this is what production vector databases do at scale — embed every chunk once, then for each query embed it and find the nearest directions. It catches the paraphrases keyword search drops, at the cost of a model call to build the vectors. No API here — the vectors are precomputed so you can focus on the ranking.

Fill in cosine(a, b) as dot(a, b) / (magnitude(a) * magnitude(b)), score every chunk against the query, sort highest-first, and print the top 2 as <id>: <score> rounded to two decimals. Done means c1: 0.97 then c4: 0.94 — and c2 nowhere in sight.

A retriever doesn't fetch the chunk you stored first — it fetches the chunk whose meaning points most nearly toward your question.

Meaning as vectors

This is 1 of 50 lessons.