
This is Part 3 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Part 4: From Retrieved Context to a Grounded Answer
By Part 2, we fixed the obvious problems:
And yet, systems still fail.
Results look relevant, but answers are incomplete. Good documents exist, but the model ignores them. Top results repeat themselves.
This is not an embedding problem.
This is a retrieval dynamics problem.
Most systems do this:
This looks reasonable.
It is also one of the most common sources of failure. Why?
Because similarity ranking is not the same as usefulness ranking.
Embeddings optimize for semantic similarity.
Users care about task completion.
These are not the same thing.
User asks:
How do I file a claim?
Top retrieved chunks:
From an embedding perspective, this is a success. From a user perspective, it is a failure.
All retrieved chunks are similar to the query. Only one is useful.
Top-k similarity ranking cannot distinguish this on its own.
A very common symptom:
Your top 5 results all say almost the same thing.
This happens because:
Embedding distance does not penalize redundancy.
So you get:
The LLM then struggles to synthesize an answer because there is no new signal.
Another hidden assumption in top-k retrieval is this:
Either something is relevant or it is not.
In reality, relevance exists on multiple axes:
Embedding similarity mostly captures the first.
If you do not explicitly model the others, retrieval feels random in edge cases.
A good retrieval set is not just similar.
It is diverse along the right dimensions.
Instead of asking:
What are the 5 most similar chunks?
You should ask:
What are the most useful 5 different chunks?
This is where diversity-aware retrieval comes in.
MMR is a simple but powerful idea.
It balances:
In practice, it works like this:
The result:
MMR often improves answer quality without changing embeddings or chunking.
Embedding similarity is cheap and approximate.
Sometimes you need a second pass.
Rerankers do exactly that.
They:
Rerankers are slower than embeddings but far more discriminative.
This makes them ideal for:
A common and effective pattern:
Many systems retrieve only top 5 or top 10 chunks.
This often limits recall unnecessarily.
A better pattern is:
This gives your system room to correct early ranking mistakes.
A common mistake is treating metadata filters as optional.
Metadata is how you encode non-semantic relevance.
Examples:
Embedding similarity cannot infer these reliably.
If your retrieval ignores metadata, you are asking embeddings to do a job they were not designed for.
Good systems combine:
This distinction is crucial.
You can have:
And still produce bad answers.
Why?
Because retrieval quality must be evaluated in context of the task.
The right metric is not:
It is:
This is why many teams misdiagnose problems as "LLM hallucinations" when the real issue is retrieval selection.
The main benchmarks for comparing embedding models:
Note: Benchmark scores help choose models, but always validate on your own data and queries.
A strong baseline looks like this:
Each step exists for a reason.
Skipping steps shifts burden downstream and increases failure rates.
If retrieval feels wrong, ask:
If the answer to most of these is "yes" or "no idea", retrieval dynamics are your bottleneck.
Embeddings give you recall.
Retrieval dynamics give you precision.
Top-k is not a solution. It is a starting point.
Most real gains come from how you select, not how you embed.
Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Next: Part 4: From Retrieved Context to a Grounded Answer