Part 3: Retrieval Is Not Top-K

This is Part 3 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Part 4: From Retrieved Context to a Grounded Answer

By Part 2, we fixed the obvious problems:

chunks are meaningful
dimensions are reasonable
embeddings are not doing unnecessary damage

And yet, systems still fail.

Results look relevant, but answers are incomplete. Good documents exist, but the model ignores them. Top results repeat themselves.

This is not an embedding problem.

This is a retrieval dynamics problem.

The core mistake: treating retrieval as "top-k"

Most systems do this:

Embed the query
Retrieve the top-k nearest vectors
Pass them to the LLM

This looks reasonable.

It is also one of the most common sources of failure. Why?

Because similarity ranking is not the same as usefulness ranking.

Similarity vs usefulness

Embeddings optimize for semantic similarity.

Users care about task completion.

These are not the same thing.

Example

User asks:

How do I file a claim?

Top retrieved chunks:

Definition of insurance claims
History of claim processing
Regulatory explanation of claims
One paragraph with actual filing steps

From an embedding perspective, this is a success. From a user perspective, it is a failure.

All retrieved chunks are similar to the query. Only one is useful.

Top-k similarity ranking cannot distinguish this on its own.

Why top results often repeat themselves

A very common symptom:

Your top 5 results all say almost the same thing.

This happens because:

documents share boilerplate language
chunks overlap
policies repeat definitions

Embedding distance does not penalize redundancy.

So you get:

five very similar chunks
low information diversity
wasted context window

The LLM then struggles to synthesize an answer because there is no new signal.

Relevance is not binary

Another hidden assumption in top-k retrieval is this:

Either something is relevant or it is not.

In reality, relevance exists on multiple axes:

topical relevance
procedural relevance
temporal relevance
authority or version relevance

Embedding similarity mostly captures the first.

If you do not explicitly model the others, retrieval feels random in edge cases.

The idea that fixes many systems: diversity matters

A good retrieval set is not just similar.

It is diverse along the right dimensions.

Instead of asking:

What are the 5 most similar chunks?

You should ask:

What are the most useful 5 different chunks?

This is where diversity-aware retrieval comes in.

Maximal Marginal Relevance (MMR)

MMR is a simple but powerful idea.

It balances:

similarity to the query
dissimilarity to already selected results

In practice, it works like this:

Pick the most similar chunk
For each next pick, penalize chunks that are too similar to what you already selected

The result:

fewer duplicates
more coverage
better signal density

MMR often improves answer quality without changing embeddings or chunking.

Why reranking exists

Embedding similarity is cheap and approximate.

Sometimes you need a second pass.

Rerankers do exactly that.

They:

take a small candidate set, often top 20 or 50
score each candidate more precisely
reorder them before final selection

Rerankers are slower than embeddings but far more discriminative.

This makes them ideal for:

resolving subtle distinctions
preferring procedural answers over definitions
choosing the best explanation among similar chunks

A common and effective pattern:

embeddings for recall
reranking for precision

Retrieval depth matters more than you think

Many systems retrieve only top 5 or top 10 chunks.

This often limits recall unnecessarily.

A better pattern is:

retrieve a wider candidate set, say 20–50
then apply diversity or reranking
finally select a smaller set for generation

This gives your system room to correct early ranking mistakes.

Metadata is part of retrieval, not a hack

A common mistake is treating metadata filters as optional.

Metadata is how you encode non-semantic relevance.

Examples:

document version
product line
geography
language
effective date

Embedding similarity cannot infer these reliably.

If your retrieval ignores metadata, you are asking embeddings to do a job they were not designed for.

Good systems combine:

semantic similarity
structured filtering
soft boosting or penalties

Retrieval quality ≠ answer quality

This distinction is crucial.

You can have:

high Recall@k (retrieved most of the "relevant" chunks)
good similarity scores
clean logs

And still produce bad answers.

Why?

Because retrieval quality must be evaluated in context of the task.

The right metric is not:

did we retrieve something relevant?

It is:

did we retrieve what the user needed to succeed?

This is why many teams misdiagnose problems as "LLM hallucinations" when the real issue is retrieval selection.

Embedding benchmarks

The main benchmarks for comparing embedding models:

MTEB — Most comprehensive, covers 8 task types (classification, clustering, retrieval, similarity, etc.). Most models report MTEB scores.
BEIR — Focused on information retrieval with 18 datasets. More relevant if building search systems.
STS — Semantic Textual Similarity benchmarks (STS-B, SICK). Tests core embedding capability.

Note: Benchmark scores help choose models, but always validate on your own data and queries.

A practical retrieval pipeline that works

A strong baseline looks like this:

Rewrite or normalize the query if needed
Retrieve a wide candidate set using embeddings
Apply metadata filters or boosts
Apply diversity control (MMR or similar)
Rerank for precision
Select a small, high-signal context set
Generate the answer

Each step exists for a reason.

Skipping steps shifts burden downstream and increases failure rates.

A quick diagnostic checklist

If retrieval feels wrong, ask:

Are my top results redundant?
Am I retrieving definitions instead of procedures?
Do I retrieve old versions alongside new ones?
Is top-k too small to recover from early mistakes?
Do I ever rerank, or do I trust raw similarity?

If the answer to most of these is "yes" or "no idea", retrieval dynamics are your bottleneck.

The mental model to keep

Embeddings give you recall.

Retrieval dynamics give you precision.

Top-k is not a solution. It is a starting point.

Most real gains come from how you select, not how you embed.

Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Next: Part 4: From Retrieved Context to a Grounded Answer

References and further reading

Carbonell & Goldstein, The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries
Sentence Transformers, Cross-Encoders for Reranking
LlamaIndex, Post-Processing: Reranking
LangChain, Retrieval with Contextual Compression and Reranking
Pinecone, Reranking in RAG: A Guide to Better Retrieval
RAGAS, Evaluation Framework for RAG
Qdrant, Hybrid Search: Combining Vector and Keyword Search
Muennighoff et al., MTEB: Massive Text Embedding Benchmark
Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Relevance, diversity, reranking, and why most systems return the wrong things for the right reasons