AI Glossaryinfrastructure

Distillation

A technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, achieving comparable quality at lower cost.

How It Works

Knowledge distillation trains a compact model by having it learn from a larger model's outputs rather than from raw training data. The teacher model generates responses for a set of inputs, and the student model learns to produce similar outputs. The student ends up much smaller but retains much of the teacher's capability. This is how many production-ready models are created. GPT-5-mini is likely distilled from larger GPT models. OpenAI's fine-tuning API effectively lets you distill GPT-5.2's knowledge into a GPT-5-mini-based model for your specific use case, getting close to the large model's quality at a fraction of the inference cost. For builders, distillation is a strategy for cost optimization. Step 1: Build your feature with a large, expensive model (GPT-5.2, Claude Opus). Step 2: Collect the input-output pairs from production usage. Step 3: Fine-tune a smaller model on those pairs. Step 4: Replace the large model with the distilled smaller one. This can reduce inference costs by 5-10x while maintaining 90%+ quality.

Common Use Cases

  • 1Reducing inference costs in production
  • 2Creating task-specific compact models
  • 3Mobile model optimization
  • 4Building cheaper alternatives to large models

Related Terms

Need help implementing Distillation?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Distillation in real products every day.

Let's Talk