AI Glossaryinfrastructure

Quantization

A technique that reduces AI model size and memory requirements by using lower-precision numbers to represent model weights, trading a small accuracy loss for major efficiency gains.

How It Works

AI models store their parameters as floating-point numbers, typically in 16-bit (FP16) or 32-bit (FP32) precision. Quantization converts these to lower precision: 8-bit (INT8), 4-bit (INT4), or even 2-bit. A 70B-parameter model at FP16 requires ~140GB of memory. At 4-bit quantization, it fits in ~35GB, making it runnable on consumer hardware. Quantization techniques range from simple (round weights to nearest lower-precision value) to sophisticated (GPTQ, AWQ, GGUF formats that minimize accuracy loss). Tools like llama.cpp and GGUF format have made quantized models accessible. You can run a quantized Llama model on a MacBook with 16GB RAM. For builders, quantization is relevant when self-hosting models or deploying to edge devices. The accuracy tradeoff is usually small: 4-bit quantized models retain 95-99% of the original model's quality for most tasks. The main impact is on complex reasoning and nuanced language understanding. Always benchmark quantized models on your specific task before deploying.

Common Use Cases

  • 1Running LLMs on consumer hardware
  • 2Mobile and edge AI deployment
  • 3Reducing inference costs
  • 4Fitting larger models in limited GPU memory

Related Terms

Need help implementing Quantization?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Quantization in real products every day.

Let's Talk