Part 2: How Embeddings Actually Work in Practice

This is Part 2 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 3: Retrieval Is Not Top-K | Part 4: From Retrieved Context to a Grounded Answer

Part 1 explained what embeddings are and why they exist.

This part explains how embeddings are actually used in real systems and why most problems come from configuration, not models.

If embeddings feel unreliable, vague, or inconsistent, the reason is almost always here.

The uncomfortable reality

Most embedding systems do not fail because:

the model is weak
the vector database is slow
the dimension is "too small"

They fail because:

meaning was broken before embedding
too much meaning was compressed into too little space
defaults were copied without understanding

Embeddings are simple. The pipeline around them is not.

Chunking is not preprocessing

It defines what your system can understand

A chunk is not "some text".

A chunk is the smallest unit of meaning your system is allowed to retrieve.

If an idea is split across chunks, your system cannot retrieve it as a whole. No model can fix that later.

This makes chunking a modeling decision, not a formatting step.

The most common chunking strategies (and when they work)

1. Fixed-size token chunking

Split text every N tokens with some overlap.

Typical values:

300 to 600 tokens
10–20% overlap

Why people use it

easy
predictable cost
works on flat prose

Why it fails

meaning ignores token boundaries
rules get separated from conditions
exceptions get detached from statements

This strategy is acceptable as a baseline, but rarely optimal.

2. Structure-aware chunking

Chunk by:

headings
paragraphs
list blocks
tables

This respects how documents already encode meaning.

Example: Instead of embedding isolated sentences, embed the full section under "Cancellation Policy".

This preserves:

rules
qualifiers
exceptions

For policies, manuals, and documentation, this is often the biggest quality jump you can get.

3. Sliding window chunking

Use a moving window across text.

Example:

window: 400 tokens
step: 200 tokens

Each idea appears in multiple chunks.

What it improves

recall
boundary safety

What it costs

more vectors
more redundancy
noisier top-k results

Sliding windows are brute-force insurance. Useful when missing information is worse than duplication.

4. Hierarchical (parent–child) chunking

Store:

small chunks for retrieval
larger parent chunks for context

Retrieve the child, then expand to the parent before generation.

This solves a core tradeoff:

small chunks retrieve precisely
large chunks contain complete answers

This pattern is extremely effective for long documents and complex questions, at the cost of added system complexity.

5. Contextual chunking

A chunk often depends on its surrounding context.

Instead of embedding raw text, prepend lightweight context:

document title
section path
version or date

Example:

Document: Health Policy Section: Cancellation Text: Cancellation is allowed within 14 days…

This makes chunks self-contained without inflating them.

This is one of the highest leverage improvements in real systems.

Chunk size is not a magic number

It depends on your questions

Chunk size trades off precision and completeness.

Chunk size	Behavior
Very small	Precise, but fragmented
Medium	Balanced, most common
Large	Context-rich, but vague

A useful practical mapping:

Query type	Chunking bias
"Where is X defined?"	smaller chunks
"What are the rules for X?"	section-sized chunks
"How do I do X?"	parent–child
"Compare A vs B"	summaries or hierarchy

If your chunk does not contain the full idea, retrieval cannot succeed.

Dimensions: what they actually control

Embedding dimension controls representation capacity, not intelligence.

More dimensions mean:

finer distinctions
fewer semantic collisions
higher cost and slower search

Less dimensions mean:

stronger compression
faster retrieval
higher risk of different ideas collapsing together

The dimensions you actually see in practice today

Here is what real systems commonly use.

Dimension	Why it exists
384	fast, lightweight, edge use
512	older sentence models
768	BERT-era standard, still common
1024	better separation, manageable cost
1536	very common modern default
3072	high-fidelity, expensive, niche

768 persists mostly due to history, not because it is optimal.

It works well when:

chunks are small
ideas are narrow
language is not extremely specialized

When 768 stops being enough

You usually see degradation when:

chunks are large
documents are legally or technically dense
many clauses differ only slightly
domain language is specialized

Examples:

insurance endorsements
legal exceptions
compliance rules
code search across similar functions

Here, higher dimensions reduce semantic collisions, not errors.

Chunk size and dimension must scale together

This interaction is widely ignored.

Think of it this way:

Chunk size decides how much meaning you pack in. Dimension decides how finely you can encode it.

A practical guide:

Chunk size	Dimension range
100–200 tokens	384–768
300–600 tokens	768–1536
800–1500 tokens	1536+ or hierarchy
Full sections	parent–child

If you pack a lot of meaning into a chunk but keep dimensions low, compression destroys distinctions.

Storage and cost math you should do once

Most embeddings use float32 vectors.

That is 4 bytes per dimension.

Dimension	Size per vector
768	~3 KB
1024	~4 KB
1536	~6 KB
3072	~12 KB

At scale, this dominates cost.

1 million chunks at 1536 dimensions:

~6 GB raw vectors
plus index overhead
plus metadata
plus replication

Dimension choice is infrastructure, not theory.

Dimension reduction that actually works

Three approaches are common today.

1. Models that support dimension selection

Some modern embedding models allow you to choose output dimensions directly, preserving quality better than post-hoc reduction.

This is the cleanest option.

2. Matryoshka-style embeddings

These are trained so that:

the first 256, 512, or 768 dimensions are meaningful
additional dimensions add refinement

This lets systems:

retrieve cheaply
rerank with higher precision if needed

This pattern is increasingly common in large-scale systems.

3. PCA and projections

This can work, but quality drops unpredictably.

Use only if you measure carefully.

Overlap: what it really does

Overlap protects against boundary cuts.

It does not add intelligence.

Overlap	When it makes sense
5–10%	clean structure
10–20%	narrative text
>25%	usually masking bad chunking

If overlap is doing most of the work, your boundaries are wrong.

The part everyone forgets: search index tuning

Most vector search is approximate.

As dimension increases:

search becomes harder
recall drops if parameters stay fixed

Many teams blame embeddings when the real issue is insufficient search depth.

Always tune recall before switching models.

A sane default that rarely embarrasses you

If you need a starting point:

structure-aware chunking
400–600 tokens per chunk
10% overlap
contextual prefix (title + section)
768 or 1024 dimensions
parent–child for long answers

Then evaluate on real queries.

Not benchmarks. Your users' questions.

(For more on embedding benchmarks and which ones matter, see Part 3.)

The mental model to keep

Embeddings are compression.

Chunking decides what you compress. Dimensions decide how much detail survives.

Most failures are compression mistakes, not model mistakes.

What comes next

Even with perfect chunking and dimensions, retrieval can still feel wrong.

Because relevance is not the same as usefulness.

The next logical step is retrieval dynamics:

top-k limits
redundancy
diversity
reranking

That is what Part 3 explores.

Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Next: Part 3: Retrieval Is Not Top-K → Part 4: From Retrieved Context to a Grounded Answer

References and further reading

LangChain, Text Splitters
Pinecone, Chunking Strategies for LLM Applications
Cohere, Embeddings: Dimensions and Distance Metrics
OpenAI, Embeddings: Best Practices
LlamaIndex, Chunking Best Practices
Matryoshka Embeddings: Matryoshka Representation Learning

Chunking, dimensions, and the knobs that decide whether retrieval works