What are the main use cases for Inference?

API-based AI features. Real-time predictions. Batch processing. Edge deployment

Inference

The process of running a trained AI model to generate predictions or outputs from new inputs, as opposed to training the model.

How It Works

When you call the OpenAI API, you're running inference. The model is already trained; it's just processing your input and generating output. Inference costs depend on: model size (GPT-5.2 costs more than GPT-5-mini), input/output tokens, and latency requirements. Self-hosting models (like Llama) gives you control over inference costs but requires GPU infrastructure. Most production apps use API-based inference for simplicity.

Common Use Cases

1API-based AI features
2Real-time predictions
3Batch processing
4Edge deployment

Related Terms

Large Language Model (LLM)

A neural network trained on massive text datasets that can generate, understand, and reason about human language.

Tokenization

The process of breaking text into smaller units (tokens) that an AI model can process, typically subwords or word pieces.

Fine-Tuning

The process of further training a pre-trained AI model on your specific data to improve performance on domain-specific tasks.

Need help implementing Inference?

AI 4U builds production AI apps in 2-4 weeks. We use Inference in real products every day.

Let's Talk