WTF is Tokenization?

You've probably heard that ChatGPT "predicts the next word." That's... not quite right. It predicts the next token. And if you've ever wondered why your AI can't count letters, struggles with simple math, or occasionally does something completely unhinged - the answer is almost always the same.

Tokenization.

The One Thing You Need to Know

LLMs don't read words. They don't read characters. They read token IDs - integers from a fixed vocabulary.

Everything else - pricing, context limits, latency, weird behaviors, why Korean costs more than English - flows from this single fact.

Why Tokenization Exists

Neural networks can't operate on strings like "insurance" or "नमस्ते". They need numbers. Tokenization is the bridge:

Text → tokens (split into chunks)
Tokens → token IDs (lookup in a vocabulary)
IDs → vectors (embeddings the model actually processes)
Model predicts the next token ID
Token IDs → tokens → text (decode back to readable output)

The "magic" is in the model's ability to predict what comes next. The interface is tokens.

What Exactly Is a Token?

A token is not a word. A token can be:

A whole word: "cat"
Part of a word: "token" + "ization"
Punctuation: ",", "."
Whitespace (sometimes its own token)
Single characters (especially for rare strings)
Or something completely arbitrary like " SolidGoldMagikarp" (more on this nightmare later)

Most modern LLMs use subword tokenization because it hits the sweet spot:

Word-level vocabularies explode in size (every new word is unknown)
Character-level sequences get too long (slow and expensive)
Subwords handle both common words and weird new strings gracefully

The Rules of Thumb

For English text:

~1 token ≈ 4 characters
~100 tokens ≈ 75 words

But these are averages. Code, URLs, and non-English text break these estimates badly.

Why Your Text Becomes Weird Chunks

The model's vocabulary contains pieces like:

"play", "ing", "un", "believ", "able"
"http", "://", ".com"
Common byte-ish fragments for edge cases

So "tokenization" might become:

["token", "ization"] or
["tok", "en", "ization"]

And here's the kicker: different models tokenize the same text differently. Their vocabularies and algorithms vary. That's why token counts (and costs) differ across models even for identical text.

How Tokenizers Are Built: BPE in 60 Seconds

Most modern LLMs use Byte Pair Encoding (BPE) or a close variant. The algorithm is beautifully simple:

Start with individual characters (or bytes) as your vocabulary
Count every adjacent pair in your training data
Merge the most frequent pair into a new token
Repeat until you hit your target vocabulary size

That's it. The result is a vocabulary that efficiently represents common patterns while still being able to handle anything you throw at it.

A Tiny Example

Starting text: aaabdaaabac

Most frequent pair is aa → merge into Z
Now you have: ZabdZabac
Most frequent pair is ab → merge into Y
Now you have: ZYdZYac
Most frequent pair is ZY → merge into X
Final: XdXac

The vocabulary learned: {a, b, c, d, Z, Y, X} where Z=aa, Y=ab, X=ZY

GPT-2 has a vocabulary of ~50,000 tokens. GPT-4 has ~100,000. These aren't hand-picked - they're learned from massive text corpora.

The Other Algorithms

WordPiece: Similar to BPE, used in BERT-style models
Unigram/SentencePiece: Probabilistic approach, doesn't require pre-splitting by spaces (great for languages without spaces)

The differences matter, but the core idea is the same: learn a vocabulary that compresses common patterns efficiently.

The Root of All Suffering

Andrej Karpathy wasn't joking when he said tokenization is the "real root of suffering" in LLMs. Here's the litany:

Why can't LLMs spell words?

The model never sees individual letters - it sees tokens. The word "tokenization" is just ["token", "ization"] to the model. It has to learn that these tokens correspond to specific letter sequences, which it only partially does.

Why can't LLMs reverse strings?

Same problem. Reversing "hello" requires knowing the letters, but the model might just see ["hello"] as a single atomic unit.

Why are LLMs bad at arithmetic?

Numbers are tokenized arbitrarily. "380" might be one token while "381" is two tokens ("38" + "1"). The model has to learn that these wildly different token patterns represent consecutive integers.

Why did GPT-2 struggle with Python?

Each space of indentation was a separate token. Four spaces = four tokens. Huge waste of context window for something that carries almost no information. GPT-4's tokenizer groups indentation spaces together.

Why are LLMs worse at non-English languages?

Tokenizers trained primarily on English are efficient for English. The same meaning in Korean, Japanese, or Hindi often requires 2-4x more tokens. This means:

More expensive (you pay per token)
Less fits in context window
Worse performance overall

Why prefer YAML over JSON?

JSON's curly braces, quotes, and colons eat tokens. YAML's whitespace-based structure is often more token-efficient for the same data.

The SolidGoldMagikarp Incident

In early 2023, researchers discovered something bizarre. When asked to repeat the string " SolidGoldMagikarp", ChatGPT would instead say "distribute". Other weird tokens caused the model to:

Recite religious texts
Insult the user
Claim to be different entities
Simply refuse to acknowledge the token existed

What happened? These were glitch tokens - strings that got into the tokenizer's vocabulary during training, but were then absent from the model's training data.

The tokenizer learned from a corpus that included Reddit usernames like SolidGoldMagikarp (a prolific poster in r/counting). But when the model itself was trained, that content was filtered out. The result: the model has a token for " SolidGoldMagikarp" but has essentially never seen it used in context.

The token's embedding was barely trained. When the model encountered it, the embedding was nearly random - close to the "center" of all token embeddings - causing completely unpredictable behavior.

This has been patched, but it reveals something important: tokenization and model training are separate processes, and misalignment between them can cause weird failures.

Why This Actually Matters to You

1. Context Windows Are Measured in Tokens

When a model says "128K context window," that's 128,000 tokens - not characters, not words. Your budget includes:

System prompts
Conversation history
Tool definitions
Your content
The model's output

Tokenization is the hidden accountant behind "why did it forget the earlier part?"

2. Cost Is Per Token

Most APIs bill per input and output token. Even if your text looks short, tokenization can bloat it:

URLs expand into many tokens
Code (especially minified) is token-expensive
Non-English text often costs 2-4x more
Base64, UUIDs, and hashes are terrible

3. Prompt Engineering Interacts with Token Boundaries

Tiny formatting changes alter tokenization:

Extra spaces and newlines
snake_case_identifiers vs camelCase
JSON vs plain English

This can subtly affect both cost and behavior.

4. Trailing Whitespace Is Weird

"once upon a " (trailing space) tokenizes differently than "once upon a time". That trailing space becomes its own token, affecting the probability distribution of what comes next.

Practical Tips

Use Tokenizer Tools When Precision Matters

Rules of thumb are fine for estimates, but for billing and limits, use the actual tokenizer:

GPT Tokenizer - Free online tool with cost estimation
OpenAI's Tokenizer
Tiktokenizer for visualization
gpt-tokenizer npm package - TypeScript implementation
tiktoken library in Python

Budget Output Tokens Intentionally

Your input tokens + output tokens must fit in the context window. If you're near the limit, the model might get cut off mid-response.

Watch Out for Token-Expensive Content

Stuff that bloats token count:

Minified JSON/JavaScript
Long URLs with query parameters
Base64-encoded data
Log files with UUIDs/hashes
Non-Latin scripts

Summarize Structure, Not Just Content

Replacing 50K tokens of raw text with 2K tokens of well-structured summary usually beats brute-forcing more context.

Consider YAML Over JSON

For structured data in prompts, YAML is often more token-efficient:


{"name": "Alice", "age": 30, "city": "NYC"}


name: Alice
age: 30
city: NYC

The YAML version typically uses fewer tokens.

The Mental Model

Here's how to think about tokenization:

Tokenization is text compression into a shared dictionary of chunks.

The better the vocabulary fits your language and domain, the fewer chunks you need, and the more the model can "see" at once.

A tokenizer trained on English text has learned efficient representations for English patterns. It's less efficient for Korean. A tokenizer trained on code has learned that (four spaces) is common enough to be a single token.

The vocabulary is a lossy compression layer between human language and model computation. It's not perfect, but it's the best we've got.

Quick Intuition Check

If ~1 token ≈ 4 characters:

Content	Approx Characters	Approx Tokens
Tweet	280	~70
This article	~10,000	~2,500
Short novel	100,000	~25,000

But remember: code, URLs, and non-English can blow these estimates way up.

The Future of Tokenization

Tokenization isn't going away, but it might get smarter:

Byte-level models are being explored - they operate directly on UTF-8 bytes, eliminating the tokenization step entirely. But they're slower and harder to train.
Domain-specific tokenizers can be trained for code, medical text, or other specialized content.
Multilingual tokenizers are getting better at efficient representation across languages.

For now, tokenization remains a quirky but essential piece of the LLM puzzle. Understanding it won't make you an AI researcher, but it will help you understand why your AI assistant sometimes does inexplicable things.

The answer is almost always tokenization.

TL;DR

LLMs don't read words - they read token IDs (integers)
Tokens are learned chunks, usually 3-4 characters in English
BPE (Byte Pair Encoding) learns common patterns from training data
Context windows, pricing, and behavior all depend on tokens
Non-English text, code, and URLs are often token-expensive
Many LLM weirdnesses trace back to tokenization quirks
Use tokenizer tools when precision matters

Want to see how your text gets tokenized? Try GPT Tokenizer to visualize the chunks.

References

OpenAI Tokenizer - Official OpenAI tokenizer tool
OpenAI Concepts - OpenAI's documentation on core concepts including tokens and models
What are tokens and how to count them - OpenAI's guide to understanding tokens and token counting
Token counting - Anthropic's documentation on token counting for Claude models
Anthropic API: Count Tokens - Anthropic's API documentation for token counting endpoint
Tiktokenizer - Interactive token visualization
GPT Tokenizer - Free online tokenizer with cost estimation
gpt-tokenizer npm package - TypeScript implementation for tokenization in your projects
tiktoken Python library - Fast BPE tokenizer for OpenAI models
Tokens, Embeddings, and Vectors: A Deep Dive into NLP Fundamentals - Comprehensive guide to tokens, embeddings, and vectors in NLP
Tokenization Deep Dive - Video explanation of tokenization concepts by Andrej Karpathy

WTF is Tokenization?

Everything you need to know about how LLMs break text into tokens - and why it explains most of their weird behaviors.

WTF is Tokenization?

The One Thing You Need to Know

Why Tokenization Exists

What Exactly Is a Token?

The Rules of Thumb

Why Your Text Becomes Weird Chunks

How Tokenizers Are Built: BPE in 60 Seconds

A Tiny Example

The Other Algorithms

The Root of All Suffering

Why can't LLMs spell words?

Why can't LLMs reverse strings?

Why are LLMs bad at arithmetic?

Why did GPT-2 struggle with Python?

Why are LLMs worse at non-English languages?

Why prefer YAML over JSON?

The SolidGoldMagikarp Incident

Why This Actually Matters to You

1. Context Windows Are Measured in Tokens

2. Cost Is Per Token

3. Prompt Engineering Interacts with Token Boundaries

4. Trailing Whitespace Is Weird

Practical Tips

Use Tokenizer Tools When Precision Matters

Budget Output Tokens Intentionally

Watch Out for Token-Expensive Content

Summarize Structure, Not Just Content

Consider YAML Over JSON

The Mental Model

Quick Intuition Check

The Future of Tokenization

TL;DR

References