LLM Bit-Rates & Quantization: 32 vs 16 vs 8 vs 4 (what changes, what doesn't)

Why models get smaller and faster, and where quality actually changes

If you've ever downloaded a "4-bit" model, or seen "INT8 inference," you've seen quantization in the wild. The confusing part is that it often sounds like a niche optimization.

It isn't.

Quantization is one of the main reasons LLMs are deployable at all: it's what turns "research artifact that needs a data-center GPU" into "something you can run locally, or serve cheaply at scale."

This post builds a working mental model you can use to make decisions.

The one-sentence definition

Bit-rate is the precision used to store/computed model numbers. Quantization is compressing those numbers into fewer bits while trying to preserve behavior.

LLMs are mostly numbers and matrix multiplies

A transformer is a stack of matrix multiplications and nonlinearities.

Those matrices are just weights, numbers. If you change how those numbers are represented, you change:

memory footprint
bandwidth requirements
speed
cost per token
sometimes: output quality and stability

Bit-rates: what 32/16/8/4 really mean

Think of each parameter as a stored value:

FP32 (32-bit float): high precision, expensive
FP16/BF16 (16-bit float): the modern default for training and a lot of inference
INT8 (8-bit): popular for inference; large wins with small quality loss in many cases
INT4 (4-bit): aggressive compression; enables local/edge runs, higher risk of edge-case degradation

A good practical overview is in Hugging Face's quantization docs, which frame quantization as lower-precision representations to reduce memory/compute and make larger models usable. (Hugging Face)

What quantization is doing under the hood

Most quantization schemes do some version of this:

Replace a floating-point weight matrix with low-bit integers
Store per-group scale factors (and sometimes zero-points)
Reconstruct approximate floats during compute (or use kernels that compute directly on packed formats)

The "magic" is the calibration and packing: where you place error matters more than whether error exists.

Two big families: weight-only vs weights+activations

This is the split that actually matters.

1) Weight-only quantization (common at 4-bit)

Weights are compressed aggressively, activations are kept in higher precision.

This is popular because activations are often the harder part to quantize without degrading accuracy.

Methods you'll see a lot:

GPTQ (post-training, weight-only): quantizes very large models down to 3–4 bits per weight while keeping accuracy close to baseline. (arXiv)
AWQ (activation-aware, weight-only): uses activation statistics to identify "salient" channels and protect them, achieving strong low-bit results without backprop reconstruction. (arXiv)

Hugging Face explicitly supports both GPTQ and AWQ workflows. (Hugging Face)

2) Weights + activations quantization (common at 8-bit for speed)

If you can quantize activations too, you can get better hardware efficiency (more of the compute runs in INT8).

A well-known approach here is SmoothQuant, which enables W8A8 (8-bit weights + 8-bit activations) by smoothing activation outliers via an equivalent transformation. (arXiv)

Tooling that matters (what most people actually use)

If you're not writing kernels, you usually encounter quantization via libraries:

bitsandbytes: used widely for 8-bit/4-bit loading and low-bit ops; designed to reduce memory footprint for limited resources. (Hugging Face)
QLoRA: a finetuning method that backpropagates into adapters while keeping the base model frozen in 4-bit; it's a key reason "finetune big models on one GPU" became realistic. (arXiv)

Where quantization hurts (real failure modes)

Quantization tends to degrade:

exact arithmetic and tight logical constraints
long chain-of-thought style reasoning stability
strict schema/JSON outputs (a single missing character breaks)
brittle code generation constraints

It's not that low-bit models "can't reason." It's that the error budget is smaller, and failure modes show up more often on brittle tasks.

Practical rules of thumb

If you're choosing precision for a product:

Default safe choice: FP16/BF16 or good INT8
Cost/throughput focus: INT8 or W8A8 approaches (e.g., SmoothQuant-style)
Local / edge / single GPU: 4-bit (GPTQ/AWQ variants)
Fine-tuning with limited VRAM: QLoRA-style approaches

And always validate on your prompt distribution. Benchmarks are not your users.

Wrapping up

Quantization isn't just compression. It's a choice about where you spend precision and what kinds of errors you can tolerate.

In Part 2, we'll connect this to a common production mystery: why does the same model behave differently depending on where you call it?

References

Quantization - Hugging Face
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - arXiv
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - arXiv
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - arXiv
Bitsandbytes - Hugging Face
QLoRA: Efficient Finetuning of Quantized LLMs - arXiv

Understanding how bit-rates and quantization shape LLM deployment, from precision trade-offs to practical quantization methods like GPTQ, AWQ, and SmoothQuant.