User profile picture

What is model quantization (INT4/INT8/FP8/FP16/FP32)?

Models store their learned knowledge as numbers—billions of them. By default, each number is a 32-bit float (FP32), like a high-resolution photo. Quantization compresses these numbers: FP16 uses half the space, INT8 uses one quarter (treating them as whole numbers from -128 to 127), and INT8 gets even smaller. The trade-off is like JPEG compression—slightly less precision, but the model still works well enough. We do this because GPUs have limited VRAM, and smaller models run faster. A 70B model in FP32 needs ~280GB of memory, but INT4 quantization brings that down to ~40GB—making it runnable on consumer hardware.

Tags:

# machine learning

# optimization