Quantization
Shrinking models by storing weights at lower numerical precision.
Quantization represents model weights (and sometimes activations) with fewer bits — for example 8-bit or 4-bit instead of 16-bit. This dramatically reduces memory and speeds up inference, usually with minimal quality loss.
It is what makes running capable open models on consumer GPUs and laptops feasible.