AI Glossaryinfrastructure

Inference

The process of running a trained AI model to generate predictions or outputs from new inputs, as opposed to training the model.

How It Works

When you call the OpenAI API, you're running inference. The model is already trained; it's just processing your input and generating output. Inference costs depend on: model size (GPT-5.2 costs more than GPT-5-mini), input/output tokens, and latency requirements. Self-hosting models (like Llama) gives you control over inference costs but requires GPU infrastructure. Most production apps use API-based inference for simplicity.

Common Use Cases

  • 1API-based AI features
  • 2Real-time predictions
  • 3Batch processing
  • 4Edge deployment

Related Terms

Need help implementing Inference?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Inference in real products every day.

Let's Talk