← All Articles

WTF is Tokenization?

Everything you need to know about how LLMs break text into tokens - and why it explains most of their weird behaviors.

2025-12-296 min read
WTF is Tokenization?

WTF is Tokenization?

You've probably heard that ChatGPT "predicts the next word." That's... not quite right. It predicts the next token. And if you've ever wondered why your AI can't count letters, struggles with simple math, or occasionally does something completely unhinged - the answer is almost always the same.

Tokenization.


The One Thing You Need to Know

LLMs don't read words. They don't read characters. They read token IDs - integers from a fixed vocabulary.

Everything else - pricing, context limits, latency, weird behaviors, why Korean costs more than English - flows from this single fact.


Why Tokenization Exists

Neural networks can't operate on strings like "insurance" or "नमस्ते". They need numbers. Tokenization is the bridge:

  1. Text → tokens (split into chunks)
  2. Tokens → token IDs (lookup in a vocabulary)
  3. IDs → vectors (embeddings the model actually processes)
  4. Model predicts the next token ID
  5. Token IDs → tokens → text (decode back to readable output)

The "magic" is in the model's ability to predict what comes next. The interface is tokens.


What Exactly Is a Token?

A token is not a word. A token can be:

  • A whole word: "cat"
  • Part of a word: "token" + "ization"
  • Punctuation: ",", "."
  • Whitespace (sometimes its own token)
  • Single characters (especially for rare strings)
  • Or something completely arbitrary like " SolidGoldMagikarp" (more on this nightmare later)

Most modern LLMs use subword tokenization because it hits the sweet spot:

  • Word-level vocabularies explode in size (every new word is unknown)
  • Character-level sequences get too long (slow and expensive)
  • Subwords handle both common words and weird new strings gracefully

The Rules of Thumb

For English text:

  • ~1 token ≈ 4 characters
  • ~100 tokens ≈ 75 words

But these are averages. Code, URLs, and non-English text break these estimates badly.


Why Your Text Becomes Weird Chunks

The model's vocabulary contains pieces like:

  • "play", "ing", "un", "believ", "able"
  • "http", "://", ".com"
  • Common byte-ish fragments for edge cases

So "tokenization" might become:

  • ["token", "ization"] or
  • ["tok", "en", "ization"]

And here's the kicker: different models tokenize the same text differently. Their vocabularies and algorithms vary. That's why token counts (and costs) differ across models even for identical text.


How Tokenizers Are Built: BPE in 60 Seconds

Most modern LLMs use Byte Pair Encoding (BPE) or a close variant. The algorithm is beautifully simple:

  1. Start with individual characters (or bytes) as your vocabulary
  2. Count every adjacent pair in your training data
  3. Merge the most frequent pair into a new token
  4. Repeat until you hit your target vocabulary size

That's it. The result is a vocabulary that efficiently represents common patterns while still being able to handle anything you throw at it.

A Tiny Example

Starting text: aaabdaaabac

  1. Most frequent pair is aa → merge into Z
  2. Now you have: ZabdZabac
  3. Most frequent pair is ab → merge into Y
  4. Now you have: ZYdZYac
  5. Most frequent pair is ZY → merge into X
  6. Final: XdXac

The vocabulary learned: {a, b, c, d, Z, Y, X} where Z=aa, Y=ab, X=ZY

GPT-2 has a vocabulary of ~50,000 tokens. GPT-4 has ~100,000. These aren't hand-picked - they're learned from massive text corpora.

The Other Algorithms

  • WordPiece: Similar to BPE, used in BERT-style models
  • Unigram/SentencePiece: Probabilistic approach, doesn't require pre-splitting by spaces (great for languages without spaces)

The differences matter, but the core idea is the same: learn a vocabulary that compresses common patterns efficiently.


The Root of All Suffering

Andrej Karpathy wasn't joking when he said tokenization is the "real root of suffering" in LLMs. Here's the litany:

Why can't LLMs spell words?

The model never sees individual letters - it sees tokens. The word "tokenization" is just ["token", "ization"] to the model. It has to learn that these tokens correspond to specific letter sequences, which it only partially does.

Why can't LLMs reverse strings?

Same problem. Reversing "hello" requires knowing the letters, but the model might just see ["hello"] as a single atomic unit.

Why are LLMs bad at arithmetic?

Numbers are tokenized arbitrarily. "380" might be one token while "381" is two tokens ("38" + "1"). The model has to learn that these wildly different token patterns represent consecutive integers.

Why did GPT-2 struggle with Python?

Each space of indentation was a separate token. Four spaces = four tokens. Huge waste of context window for something that carries almost no information. GPT-4's tokenizer groups indentation spaces together.

Why are LLMs worse at non-English languages?

Tokenizers trained primarily on English are efficient for English. The same meaning in Korean, Japanese, or Hindi often requires 2-4x more tokens. This means:

  • More expensive (you pay per token)
  • Less fits in context window
  • Worse performance overall

Why prefer YAML over JSON?

JSON's curly braces, quotes, and colons eat tokens. YAML's whitespace-based structure is often more token-efficient for the same data.


The SolidGoldMagikarp Incident

In early 2023, researchers discovered something bizarre. When asked to repeat the string " SolidGoldMagikarp", ChatGPT would instead say "distribute". Other weird tokens caused the model to:

  • Recite religious texts
  • Insult the user
  • Claim to be different entities
  • Simply refuse to acknowledge the token existed

What happened? These were glitch tokens - strings that got into the tokenizer's vocabulary during training, but were then absent from the model's training data.

The tokenizer learned from a corpus that included Reddit usernames like SolidGoldMagikarp (a prolific poster in r/counting). But when the model itself was trained, that content was filtered out. The result: the model has a token for " SolidGoldMagikarp" but has essentially never seen it used in context.

The token's embedding was barely trained. When the model encountered it, the embedding was nearly random - close to the "center" of all token embeddings - causing completely unpredictable behavior.

This has been patched, but it reveals something important: tokenization and model training are separate processes, and misalignment between them can cause weird failures.


Why This Actually Matters to You

1. Context Windows Are Measured in Tokens

When a model says "128K context window," that's 128,000 tokens - not characters, not words. Your budget includes:

  • System prompts
  • Conversation history
  • Tool definitions
  • Your content
  • The model's output

Tokenization is the hidden accountant behind "why did it forget the earlier part?"

2. Cost Is Per Token

Most APIs bill per input and output token. Even if your text looks short, tokenization can bloat it:

  • URLs expand into many tokens
  • Code (especially minified) is token-expensive
  • Non-English text often costs 2-4x more
  • Base64, UUIDs, and hashes are terrible

3. Prompt Engineering Interacts with Token Boundaries

Tiny formatting changes alter tokenization:

  • Extra spaces and newlines
  • snake_case_identifiers vs camelCase
  • JSON vs plain English

This can subtly affect both cost and behavior.

4. Trailing Whitespace Is Weird

"once upon a " (trailing space) tokenizes differently than "once upon a time". That trailing space becomes its own token, affecting the probability distribution of what comes next.


Practical Tips

Use Tokenizer Tools When Precision Matters

Rules of thumb are fine for estimates, but for billing and limits, use the actual tokenizer:

Budget Output Tokens Intentionally

Your input tokens + output tokens must fit in the context window. If you're near the limit, the model might get cut off mid-response.

Watch Out for Token-Expensive Content

Stuff that bloats token count:

  • Minified JSON/JavaScript
  • Long URLs with query parameters
  • Base64-encoded data
  • Log files with UUIDs/hashes
  • Non-Latin scripts

Summarize Structure, Not Just Content

Replacing 50K tokens of raw text with 2K tokens of well-structured summary usually beats brute-forcing more context.

Consider YAML Over JSON

For structured data in prompts, YAML is often more token-efficient:

{"name": "Alice", "age": 30, "city": "NYC"}

vs

name: Alice age: 30 city: NYC

The YAML version typically uses fewer tokens.


The Mental Model

Here's how to think about tokenization:

Tokenization is text compression into a shared dictionary of chunks.

The better the vocabulary fits your language and domain, the fewer chunks you need, and the more the model can "see" at once.

A tokenizer trained on English text has learned efficient representations for English patterns. It's less efficient for Korean. A tokenizer trained on code has learned that (four spaces) is common enough to be a single token.

The vocabulary is a lossy compression layer between human language and model computation. It's not perfect, but it's the best we've got.


Quick Intuition Check

If ~1 token ≈ 4 characters:

ContentApprox CharactersApprox Tokens
Tweet280~70
This article~10,000~2,500
Short novel100,000~25,000

But remember: code, URLs, and non-English can blow these estimates way up.


The Future of Tokenization

Tokenization isn't going away, but it might get smarter:

  • Byte-level models are being explored - they operate directly on UTF-8 bytes, eliminating the tokenization step entirely. But they're slower and harder to train.
  • Domain-specific tokenizers can be trained for code, medical text, or other specialized content.
  • Multilingual tokenizers are getting better at efficient representation across languages.

For now, tokenization remains a quirky but essential piece of the LLM puzzle. Understanding it won't make you an AI researcher, but it will help you understand why your AI assistant sometimes does inexplicable things.

The answer is almost always tokenization.


TL;DR

  • LLMs don't read words - they read token IDs (integers)
  • Tokens are learned chunks, usually 3-4 characters in English
  • BPE (Byte Pair Encoding) learns common patterns from training data
  • Context windows, pricing, and behavior all depend on tokens
  • Non-English text, code, and URLs are often token-expensive
  • Many LLM weirdnesses trace back to tokenization quirks
  • Use tokenizer tools when precision matters

Want to see how your text gets tokenized? Try GPT Tokenizer to visualize the chunks.


References