AI Glossaryinfrastructure
Inference
The process of running a trained AI model to generate predictions or outputs from new inputs, as opposed to training the model.
How It Works
When you call the OpenAI API, you're running inference. The model is already trained; it's just processing your input and generating output. Inference costs depend on: model size (GPT-5.2 costs more than GPT-5-mini), input/output tokens, and latency requirements. Self-hosting models (like Llama) gives you control over inference costs but requires GPU infrastructure. Most production apps use API-based inference for simplicity.
Common Use Cases
- 1API-based AI features
- 2Real-time predictions
- 3Batch processing
- 4Edge deployment
Related Terms
Large Language Model (LLM)
A neural network trained on massive text datasets that can generate, understand, and reason about human language.
TokenizationThe process of breaking text into smaller units (tokens) that an AI model can process, typically subwords or word pieces.
Fine-TuningThe process of further training a pre-trained AI model on your specific data to improve performance on domain-specific tasks.
Need help implementing Inference?
AI 4U Labs builds production AI apps in 2-4 weeks. We use Inference in real products every day.
Let's Talk