← All Articles

Part 3: Retrieval Is Not Top-K

Relevance, diversity, reranking, and why most systems return the wrong things for the right reasons

2026-01-044 min read
Part 3: Retrieval Is Not Top-K

This is Part 3 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Part 4: From Retrieved Context to a Grounded Answer

By Part 2, we fixed the obvious problems:

  • chunks are meaningful
  • dimensions are reasonable
  • embeddings are not doing unnecessary damage

And yet, systems still fail.

Results look relevant, but answers are incomplete. Good documents exist, but the model ignores them. Top results repeat themselves.

This is not an embedding problem.

This is a retrieval dynamics problem.


The core mistake: treating retrieval as "top-k"

Most systems do this:

  1. Embed the query
  2. Retrieve the top-k nearest vectors
  3. Pass them to the LLM

This looks reasonable.

It is also one of the most common sources of failure. Why?

Because similarity ranking is not the same as usefulness ranking.


Similarity vs usefulness

Embeddings optimize for semantic similarity.

Users care about task completion.

These are not the same thing.

Example

User asks:

How do I file a claim?

Top retrieved chunks:

  1. Definition of insurance claims
  2. History of claim processing
  3. Regulatory explanation of claims
  4. One paragraph with actual filing steps

From an embedding perspective, this is a success. From a user perspective, it is a failure.

All retrieved chunks are similar to the query. Only one is useful.

Top-k similarity ranking cannot distinguish this on its own.


Why top results often repeat themselves

A very common symptom:

Your top 5 results all say almost the same thing.

This happens because:

  • documents share boilerplate language
  • chunks overlap
  • policies repeat definitions

Embedding distance does not penalize redundancy.

So you get:

  • five very similar chunks
  • low information diversity
  • wasted context window

The LLM then struggles to synthesize an answer because there is no new signal.


Relevance is not binary

Another hidden assumption in top-k retrieval is this:

Either something is relevant or it is not.

In reality, relevance exists on multiple axes:

  • topical relevance
  • procedural relevance
  • temporal relevance
  • authority or version relevance

Embedding similarity mostly captures the first.

If you do not explicitly model the others, retrieval feels random in edge cases.


The idea that fixes many systems: diversity matters

A good retrieval set is not just similar.

It is diverse along the right dimensions.

Instead of asking:

What are the 5 most similar chunks?

You should ask:

What are the most useful 5 different chunks?

This is where diversity-aware retrieval comes in.


Maximal Marginal Relevance (MMR)

MMR is a simple but powerful idea.

It balances:

  • similarity to the query
  • dissimilarity to already selected results

In practice, it works like this:

  1. Pick the most similar chunk
  2. For each next pick, penalize chunks that are too similar to what you already selected

The result:

  • fewer duplicates
  • more coverage
  • better signal density

MMR often improves answer quality without changing embeddings or chunking.


Why reranking exists

Embedding similarity is cheap and approximate.

Sometimes you need a second pass.

Rerankers do exactly that.

They:

  • take a small candidate set, often top 20 or 50
  • score each candidate more precisely
  • reorder them before final selection

Rerankers are slower than embeddings but far more discriminative.

This makes them ideal for:

  • resolving subtle distinctions
  • preferring procedural answers over definitions
  • choosing the best explanation among similar chunks

A common and effective pattern:

  • embeddings for recall
  • reranking for precision

Retrieval depth matters more than you think

Many systems retrieve only top 5 or top 10 chunks.

This often limits recall unnecessarily.

A better pattern is:

  • retrieve a wider candidate set, say 20–50
  • then apply diversity or reranking
  • finally select a smaller set for generation

This gives your system room to correct early ranking mistakes.


Metadata is part of retrieval, not a hack

A common mistake is treating metadata filters as optional.

Metadata is how you encode non-semantic relevance.

Examples:

  • document version
  • product line
  • geography
  • language
  • effective date

Embedding similarity cannot infer these reliably.

If your retrieval ignores metadata, you are asking embeddings to do a job they were not designed for.

Good systems combine:

  • semantic similarity
  • structured filtering
  • soft boosting or penalties

Retrieval quality ≠ answer quality

This distinction is crucial.

You can have:

  • high Recall@k (retrieved most of the "relevant" chunks)
  • good similarity scores
  • clean logs

And still produce bad answers.

Why?

Because retrieval quality must be evaluated in context of the task.

The right metric is not:

  • did we retrieve something relevant?

It is:

  • did we retrieve what the user needed to succeed?

This is why many teams misdiagnose problems as "LLM hallucinations" when the real issue is retrieval selection.


Embedding benchmarks

The main benchmarks for comparing embedding models:

  • MTEB — Most comprehensive, covers 8 task types (classification, clustering, retrieval, similarity, etc.). Most models report MTEB scores.
  • BEIR — Focused on information retrieval with 18 datasets. More relevant if building search systems.
  • STS — Semantic Textual Similarity benchmarks (STS-B, SICK). Tests core embedding capability.

Note: Benchmark scores help choose models, but always validate on your own data and queries.


A practical retrieval pipeline that works

A strong baseline looks like this:

  1. Rewrite or normalize the query if needed
  2. Retrieve a wide candidate set using embeddings
  3. Apply metadata filters or boosts
  4. Apply diversity control (MMR or similar)
  5. Rerank for precision
  6. Select a small, high-signal context set
  7. Generate the answer

Each step exists for a reason.

Skipping steps shifts burden downstream and increases failure rates.


A quick diagnostic checklist

If retrieval feels wrong, ask:

  • Are my top results redundant?
  • Am I retrieving definitions instead of procedures?
  • Do I retrieve old versions alongside new ones?
  • Is top-k too small to recover from early mistakes?
  • Do I ever rerank, or do I trust raw similarity?

If the answer to most of these is "yes" or "no idea", retrieval dynamics are your bottleneck.


The mental model to keep

Embeddings give you recall.

Retrieval dynamics give you precision.

Top-k is not a solution. It is a starting point.

Most real gains come from how you select, not how you embed.


Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Part 2: How Embeddings Actually Work in Practice | Next: Part 4: From Retrieved Context to a Grounded Answer


References and further reading