
This is Part 2 of a 4-part series on embeddings. Part 1: Embeddings: Turning Meaning Into Geometry | Part 3: Retrieval Is Not Top-K | Part 4: From Retrieved Context to a Grounded Answer
Part 1 explained what embeddings are and why they exist.
This part explains how embeddings are actually used in real systems and why most problems come from configuration, not models.
If embeddings feel unreliable, vague, or inconsistent, the reason is almost always here.
Most embedding systems do not fail because:
They fail because:
Embeddings are simple. The pipeline around them is not.
A chunk is not "some text".
A chunk is the smallest unit of meaning your system is allowed to retrieve.
If an idea is split across chunks, your system cannot retrieve it as a whole. No model can fix that later.
This makes chunking a modeling decision, not a formatting step.
Split text every N tokens with some overlap.
Typical values:
Why people use it
Why it fails
This strategy is acceptable as a baseline, but rarely optimal.
Chunk by:
This respects how documents already encode meaning.
Example: Instead of embedding isolated sentences, embed the full section under "Cancellation Policy".
This preserves:
For policies, manuals, and documentation, this is often the biggest quality jump you can get.
Use a moving window across text.
Example:
Each idea appears in multiple chunks.
What it improves
What it costs
Sliding windows are brute-force insurance. Useful when missing information is worse than duplication.
Store:
Retrieve the child, then expand to the parent before generation.
This solves a core tradeoff:
This pattern is extremely effective for long documents and complex questions, at the cost of added system complexity.
A chunk often depends on its surrounding context.
Instead of embedding raw text, prepend lightweight context:
Example:
Document: Health Policy Section: Cancellation Text: Cancellation is allowed within 14 days…
This makes chunks self-contained without inflating them.
This is one of the highest leverage improvements in real systems.
Chunk size trades off precision and completeness.
| Chunk size | Behavior |
|---|---|
| Very small | Precise, but fragmented |
| Medium | Balanced, most common |
| Large | Context-rich, but vague |
A useful practical mapping:
| Query type | Chunking bias |
|---|---|
| "Where is X defined?" | smaller chunks |
| "What are the rules for X?" | section-sized chunks |
| "How do I do X?" | parent–child |
| "Compare A vs B" | summaries or hierarchy |
If your chunk does not contain the full idea, retrieval cannot succeed.
Embedding dimension controls representation capacity, not intelligence.
More dimensions mean:
Less dimensions mean:
Here is what real systems commonly use.
| Dimension | Why it exists |
|---|---|
| 384 | fast, lightweight, edge use |
| 512 | older sentence models |
| 768 | BERT-era standard, still common |
| 1024 | better separation, manageable cost |
| 1536 | very common modern default |
| 3072 | high-fidelity, expensive, niche |
768 persists mostly due to history, not because it is optimal.
It works well when:
You usually see degradation when:
Examples:
Here, higher dimensions reduce semantic collisions, not errors.
This interaction is widely ignored.
Think of it this way:
Chunk size decides how much meaning you pack in. Dimension decides how finely you can encode it.
A practical guide:
| Chunk size | Dimension range |
|---|---|
| 100–200 tokens | 384–768 |
| 300–600 tokens | 768–1536 |
| 800–1500 tokens | 1536+ or hierarchy |
| Full sections | parent–child |
If you pack a lot of meaning into a chunk but keep dimensions low, compression destroys distinctions.
Most embeddings use float32 vectors.
That is 4 bytes per dimension.
| Dimension | Size per vector |
|---|---|
| 768 | ~3 KB |
| 1024 | ~4 KB |
| 1536 | ~6 KB |
| 3072 | ~12 KB |
At scale, this dominates cost.
1 million chunks at 1536 dimensions:
Dimension choice is infrastructure, not theory.
Three approaches are common today.
Some modern embedding models allow you to choose output dimensions directly, preserving quality better than post-hoc reduction.
This is the cleanest option.
These are trained so that:
This lets systems:
This pattern is increasingly common in large-scale systems.
This can work, but quality drops unpredictably.
Use only if you measure carefully.
Overlap protects against boundary cuts.
It does not add intelligence.
| Overlap | When it makes sense |
|---|---|
| 5–10% | clean structure |
| 10–20% | narrative text |
| >25% | usually masking bad chunking |
If overlap is doing most of the work, your boundaries are wrong.
Most vector search is approximate.
As dimension increases:
Many teams blame embeddings when the real issue is insufficient search depth.
Always tune recall before switching models.
If you need a starting point:
Then evaluate on real queries.
Not benchmarks. Your users' questions.
(For more on embedding benchmarks and which ones matter, see Part 3.)
Embeddings are compression.
Chunking decides what you compress. Dimensions decide how much detail survives.
Most failures are compression mistakes, not model mistakes.
Even with perfect chunking and dimensions, retrieval can still feel wrong.
Because relevance is not the same as usefulness.
The next logical step is retrieval dynamics:
That is what Part 3 explores.
Previous: Part 1: Embeddings: Turning Meaning Into Geometry | Next: Part 3: Retrieval Is Not Top-K → Part 4: From Retrieved Context to a Grounded Answer