Why models get smaller and faster, and where quality actually changes
If you've ever downloaded a "4-bit" model, or seen "INT8 inference," you've seen quantization in the wild. The confusing part is that it often sounds like a niche optimization.
It isn't.
Quantization is one of the main reasons LLMs are deployable at all: it's what turns "research artifact that needs a data-center GPU" into "something you can run locally, or serve cheaply at scale."
This post builds a working mental model you can use to make decisions.
The one-sentence definition
Bit-rate is the precision used to store/computed model numbers.
Quantization is compressing those numbers into fewer bits while trying to preserve behavior.
LLMs are mostly numbers and matrix multiplies
A transformer is a stack of matrix multiplications and nonlinearities.
Those matrices are just weights, numbers. If you change how those numbers are represented, you change:
- memory footprint
- bandwidth requirements
- speed
- cost per token
- sometimes: output quality and stability
Bit-rates: what 32/16/8/4 really mean
Think of each parameter as a stored value:
- FP32 (32-bit float): high precision, expensive
- FP16/BF16 (16-bit float): the modern default for training and a lot of inference
- INT8 (8-bit): popular for inference; large wins with small quality loss in many cases
- INT4 (4-bit): aggressive compression; enables local/edge runs, higher risk of edge-case degradation
A good practical overview is in Hugging Face's quantization docs, which frame quantization as lower-precision representations to reduce memory/compute and make larger models usable. (Hugging Face)
What quantization is doing under the hood
Most quantization schemes do some version of this:
- Replace a floating-point weight matrix with low-bit integers
- Store per-group scale factors (and sometimes zero-points)
- Reconstruct approximate floats during compute (or use kernels that compute directly on packed formats)
The "magic" is the calibration and packing: where you place error matters more than whether error exists.
Two big families: weight-only vs weights+activations
This is the split that actually matters.
1) Weight-only quantization (common at 4-bit)
Weights are compressed aggressively, activations are kept in higher precision.
This is popular because activations are often the harder part to quantize without degrading accuracy.
Methods you'll see a lot:
- GPTQ (post-training, weight-only): quantizes very large models down to 3–4 bits per weight while keeping accuracy close to baseline. (arXiv)
- AWQ (activation-aware, weight-only): uses activation statistics to identify "salient" channels and protect them, achieving strong low-bit results without backprop reconstruction. (arXiv)
Hugging Face explicitly supports both GPTQ and AWQ workflows. (Hugging Face)
2) Weights + activations quantization (common at 8-bit for speed)
If you can quantize activations too, you can get better hardware efficiency (more of the compute runs in INT8).
A well-known approach here is SmoothQuant, which enables W8A8 (8-bit weights + 8-bit activations) by smoothing activation outliers via an equivalent transformation. (arXiv)
If you're not writing kernels, you usually encounter quantization via libraries:
- bitsandbytes: used widely for 8-bit/4-bit loading and low-bit ops; designed to reduce memory footprint for limited resources. (Hugging Face)
- QLoRA: a finetuning method that backpropagates into adapters while keeping the base model frozen in 4-bit; it's a key reason "finetune big models on one GPU" became realistic. (arXiv)
Where quantization hurts (real failure modes)
Quantization tends to degrade:
- exact arithmetic and tight logical constraints
- long chain-of-thought style reasoning stability
- strict schema/JSON outputs (a single missing character breaks)
- brittle code generation constraints
It's not that low-bit models "can't reason."
It's that the error budget is smaller, and failure modes show up more often on brittle tasks.
Practical rules of thumb
If you're choosing precision for a product:
- Default safe choice: FP16/BF16 or good INT8
- Cost/throughput focus: INT8 or W8A8 approaches (e.g., SmoothQuant-style)
- Local / edge / single GPU: 4-bit (GPTQ/AWQ variants)
- Fine-tuning with limited VRAM: QLoRA-style approaches
And always validate on your prompt distribution. Benchmarks are not your users.
Wrapping up
Quantization isn't just compression. It's a choice about where you spend precision and what kinds of errors you can tolerate.
In Part 2, we'll connect this to a common production mystery: why does the same model behave differently depending on where you call it?
References
- Quantization - Hugging Face
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - arXiv
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - arXiv
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - arXiv
- Bitsandbytes - Hugging Face
- QLoRA: Efficient Finetuning of Quantized LLMs - arXiv