Quantization

Shrinking models by storing weights at lower numerical precision.

Quantization represents model weights (and sometimes activations) with fewer bits — for example 8-bit or 4-bit instead of 16-bit. This dramatically reduces memory and speeds up inference, usually with minimal quality loss.

It is what makes running capable open models on consumer GPUs and laptops feasible.

Related papers