← All Articles

Part 2: How Embeddings Actually Work in Practice

Chunking, dimensions, and the knobs that decide whether retrieval works

2026-01-045 min read
Part 2: How Embeddings Actually Work in Practice

This is Part 2 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 3: Retrieval Is Not Top-K | Part 4: From Retrieved Context to a Grounded Answer

Part 1 explained what embeddings are and why they exist.

This part explains how embeddings are actually used in real systems and why most problems come from configuration, not models.

If embeddings feel unreliable, vague, or inconsistent, the reason is almost always here.


The uncomfortable reality

Most embedding systems do not fail because:

  • the model is weak
  • the vector database is slow
  • the dimension is "too small"

They fail because:

  • meaning was broken before embedding
  • too much meaning was compressed into too little space
  • defaults were copied without understanding

Embeddings are simple. The pipeline around them is not.


Chunking is not preprocessing

It defines what your system can understand

A chunk is not "some text".

A chunk is the smallest unit of meaning your system is allowed to retrieve.

If an idea is split across chunks, your system cannot retrieve it as a whole. No model can fix that later.

This makes chunking a modeling decision, not a formatting step.


The most common chunking strategies (and when they work)

1. Fixed-size token chunking

Split text every N tokens with some overlap.

Typical values:

  • 300 to 600 tokens
  • 10–20% overlap

Why people use it

  • easy
  • predictable cost
  • works on flat prose

Why it fails

  • meaning ignores token boundaries
  • rules get separated from conditions
  • exceptions get detached from statements

This strategy is acceptable as a baseline, but rarely optimal.


2. Structure-aware chunking

Chunk by:

  • headings
  • paragraphs
  • list blocks
  • tables

This respects how documents already encode meaning.

Example: Instead of embedding isolated sentences, embed the full section under "Cancellation Policy".

This preserves:

  • rules
  • qualifiers
  • exceptions

For policies, manuals, and documentation, this is often the biggest quality jump you can get.


3. Sliding window chunking

Use a moving window across text.

Example:

  • window: 400 tokens
  • step: 200 tokens

Each idea appears in multiple chunks.

What it improves

  • recall
  • boundary safety

What it costs

  • more vectors
  • more redundancy
  • noisier top-k results

Sliding windows are brute-force insurance. Useful when missing information is worse than duplication.


4. Hierarchical (parent–child) chunking

Store:

  • small chunks for retrieval
  • larger parent chunks for context

Retrieve the child, then expand to the parent before generation.

This solves a core tradeoff:

  • small chunks retrieve precisely
  • large chunks contain complete answers

This pattern is extremely effective for long documents and complex questions, at the cost of added system complexity.


5. Contextual chunking

A chunk often depends on its surrounding context.

Instead of embedding raw text, prepend lightweight context:

  • document title
  • section path
  • version or date

Example:

Document: Health Policy Section: Cancellation Text: Cancellation is allowed within 14 days…

This makes chunks self-contained without inflating them.

This is one of the highest leverage improvements in real systems.


Chunk size is not a magic number

It depends on your questions

Chunk size trades off precision and completeness.

Chunk sizeBehavior
Very smallPrecise, but fragmented
MediumBalanced, most common
LargeContext-rich, but vague

A useful practical mapping:

Query typeChunking bias
"Where is X defined?"smaller chunks
"What are the rules for X?"section-sized chunks
"How do I do X?"parent–child
"Compare A vs B"summaries or hierarchy

If your chunk does not contain the full idea, retrieval cannot succeed.


Dimensions: what they actually control

Embedding dimension controls representation capacity, not intelligence.

More dimensions mean:

  • finer distinctions
  • fewer semantic collisions
  • higher cost and slower search

Less dimensions mean:

  • stronger compression
  • faster retrieval
  • higher risk of different ideas collapsing together

The dimensions you actually see in practice today

Here is what real systems commonly use.

DimensionWhy it exists
384fast, lightweight, edge use
512older sentence models
768BERT-era standard, still common
1024better separation, manageable cost
1536very common modern default
3072high-fidelity, expensive, niche

768 persists mostly due to history, not because it is optimal.

It works well when:

  • chunks are small
  • ideas are narrow
  • language is not extremely specialized

When 768 stops being enough

You usually see degradation when:

  • chunks are large
  • documents are legally or technically dense
  • many clauses differ only slightly
  • domain language is specialized

Examples:

  • insurance endorsements
  • legal exceptions
  • compliance rules
  • code search across similar functions

Here, higher dimensions reduce semantic collisions, not errors.


Chunk size and dimension must scale together

This interaction is widely ignored.

Think of it this way:

Chunk size decides how much meaning you pack in. Dimension decides how finely you can encode it.

A practical guide:

Chunk sizeDimension range
100–200 tokens384–768
300–600 tokens768–1536
800–1500 tokens1536+ or hierarchy
Full sectionsparent–child

If you pack a lot of meaning into a chunk but keep dimensions low, compression destroys distinctions.


Storage and cost math you should do once

Most embeddings use float32 vectors.

That is 4 bytes per dimension.

DimensionSize per vector
768~3 KB
1024~4 KB
1536~6 KB
3072~12 KB

At scale, this dominates cost.

1 million chunks at 1536 dimensions:

  • ~6 GB raw vectors
  • plus index overhead
  • plus metadata
  • plus replication

Dimension choice is infrastructure, not theory.


Dimension reduction that actually works

Three approaches are common today.

1. Models that support dimension selection

Some modern embedding models allow you to choose output dimensions directly, preserving quality better than post-hoc reduction.

This is the cleanest option.


2. Matryoshka-style embeddings

These are trained so that:

  • the first 256, 512, or 768 dimensions are meaningful
  • additional dimensions add refinement

This lets systems:

  • retrieve cheaply
  • rerank with higher precision if needed

This pattern is increasingly common in large-scale systems.


3. PCA and projections

This can work, but quality drops unpredictably.

Use only if you measure carefully.


Overlap: what it really does

Overlap protects against boundary cuts.

It does not add intelligence.

OverlapWhen it makes sense
5–10%clean structure
10–20%narrative text
>25%usually masking bad chunking

If overlap is doing most of the work, your boundaries are wrong.


The part everyone forgets: search index tuning

Most vector search is approximate.

As dimension increases:

  • search becomes harder
  • recall drops if parameters stay fixed

Many teams blame embeddings when the real issue is insufficient search depth.

Always tune recall before switching models.


A sane default that rarely embarrasses you

If you need a starting point:

  • structure-aware chunking
  • 400–600 tokens per chunk
  • 10% overlap
  • contextual prefix (title + section)
  • 768 or 1024 dimensions
  • parent–child for long answers

Then evaluate on real queries.

Not benchmarks. Your users' questions.

(For more on embedding benchmarks and which ones matter, see Part 3.)


The mental model to keep

Embeddings are compression.

Chunking decides what you compress. Dimensions decide how much detail survives.

Most failures are compression mistakes, not model mistakes.


What comes next

Even with perfect chunking and dimensions, retrieval can still feel wrong.

Because relevance is not the same as usefulness.

The next logical step is retrieval dynamics:

  • top-k limits
  • redundancy
  • diversity
  • reranking

That is what Part 3 explores.


Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Next: Part 3: Retrieval Is Not Top-KPart 4: From Retrieved Context to a Grounded Answer


References and further reading