
You've probably heard that ChatGPT "predicts the next word." That's... not quite right. It predicts the next token. And if you've ever wondered why your AI can't count letters, struggles with simple math, or occasionally does something completely unhinged - the answer is almost always the same.
Tokenization.
LLMs don't read words. They don't read characters. They read token IDs - integers from a fixed vocabulary.
Everything else - pricing, context limits, latency, weird behaviors, why Korean costs more than English - flows from this single fact.
Neural networks can't operate on strings like "insurance" or "नमस्ते". They need numbers. Tokenization is the bridge:
The "magic" is in the model's ability to predict what comes next. The interface is tokens.
A token is not a word. A token can be:
"cat""token" + "ization"",", "."" SolidGoldMagikarp" (more on this nightmare later)Most modern LLMs use subword tokenization because it hits the sweet spot:
For English text:
But these are averages. Code, URLs, and non-English text break these estimates badly.
The model's vocabulary contains pieces like:
"play", "ing", "un", "believ", "able""http", "://", ".com"So "tokenization" might become:
["token", "ization"] or["tok", "en", "ization"]And here's the kicker: different models tokenize the same text differently. Their vocabularies and algorithms vary. That's why token counts (and costs) differ across models even for identical text.
Most modern LLMs use Byte Pair Encoding (BPE) or a close variant. The algorithm is beautifully simple:
That's it. The result is a vocabulary that efficiently represents common patterns while still being able to handle anything you throw at it.
Starting text: aaabdaaabac
aa → merge into ZZabdZabacab → merge into YZYdZYacZY → merge into XXdXacThe vocabulary learned: {a, b, c, d, Z, Y, X} where Z=aa, Y=ab, X=ZY
GPT-2 has a vocabulary of ~50,000 tokens. GPT-4 has ~100,000. These aren't hand-picked - they're learned from massive text corpora.
The differences matter, but the core idea is the same: learn a vocabulary that compresses common patterns efficiently.
Andrej Karpathy wasn't joking when he said tokenization is the "real root of suffering" in LLMs. Here's the litany:
The model never sees individual letters - it sees tokens. The word "tokenization" is just ["token", "ization"] to the model. It has to learn that these tokens correspond to specific letter sequences, which it only partially does.
Same problem. Reversing "hello" requires knowing the letters, but the model might just see ["hello"] as a single atomic unit.
Numbers are tokenized arbitrarily. "380" might be one token while "381" is two tokens ("38" + "1"). The model has to learn that these wildly different token patterns represent consecutive integers.
Each space of indentation was a separate token. Four spaces = four tokens. Huge waste of context window for something that carries almost no information. GPT-4's tokenizer groups indentation spaces together.
Tokenizers trained primarily on English are efficient for English. The same meaning in Korean, Japanese, or Hindi often requires 2-4x more tokens. This means:
JSON's curly braces, quotes, and colons eat tokens. YAML's whitespace-based structure is often more token-efficient for the same data.
In early 2023, researchers discovered something bizarre. When asked to repeat the string " SolidGoldMagikarp", ChatGPT would instead say "distribute". Other weird tokens caused the model to:
What happened? These were glitch tokens - strings that got into the tokenizer's vocabulary during training, but were then absent from the model's training data.
The tokenizer learned from a corpus that included Reddit usernames like SolidGoldMagikarp (a prolific poster in r/counting). But when the model itself was trained, that content was filtered out. The result: the model has a token for " SolidGoldMagikarp" but has essentially never seen it used in context.
The token's embedding was barely trained. When the model encountered it, the embedding was nearly random - close to the "center" of all token embeddings - causing completely unpredictable behavior.
This has been patched, but it reveals something important: tokenization and model training are separate processes, and misalignment between them can cause weird failures.
When a model says "128K context window," that's 128,000 tokens - not characters, not words. Your budget includes:
Tokenization is the hidden accountant behind "why did it forget the earlier part?"
Most APIs bill per input and output token. Even if your text looks short, tokenization can bloat it:
Tiny formatting changes alter tokenization:
snake_case_identifiers vs camelCaseThis can subtly affect both cost and behavior.
"once upon a " (trailing space) tokenizes differently than "once upon a time". That trailing space becomes its own token, affecting the probability distribution of what comes next.
Rules of thumb are fine for estimates, but for billing and limits, use the actual tokenizer:
tiktoken library in PythonYour input tokens + output tokens must fit in the context window. If you're near the limit, the model might get cut off mid-response.
Stuff that bloats token count:
Replacing 50K tokens of raw text with 2K tokens of well-structured summary usually beats brute-forcing more context.
For structured data in prompts, YAML is often more token-efficient:
{"name": "Alice", "age": 30, "city": "NYC"}
vs
name: Alice
age: 30
city: NYC
The YAML version typically uses fewer tokens.
Here's how to think about tokenization:
Tokenization is text compression into a shared dictionary of chunks.
The better the vocabulary fits your language and domain, the fewer chunks you need, and the more the model can "see" at once.
A tokenizer trained on English text has learned efficient representations for English patterns. It's less efficient for Korean. A tokenizer trained on code has learned that (four spaces) is common enough to be a single token.
The vocabulary is a lossy compression layer between human language and model computation. It's not perfect, but it's the best we've got.
If ~1 token ≈ 4 characters:
| Content | Approx Characters | Approx Tokens |
|---|---|---|
| Tweet | 280 | ~70 |
| This article | ~10,000 | ~2,500 |
| Short novel | 100,000 | ~25,000 |
But remember: code, URLs, and non-English can blow these estimates way up.
Tokenization isn't going away, but it might get smarter:
For now, tokenization remains a quirky but essential piece of the LLM puzzle. Understanding it won't make you an AI researcher, but it will help you understand why your AI assistant sometimes does inexplicable things.
The answer is almost always tokenization.
Want to see how your text gets tokenized? Try GPT Tokenizer to visualize the chunks.