AI Glossaryinfrastructure

Model Serving

The infrastructure and process of hosting a trained AI model and exposing it as an API endpoint for real-time or batch inference.

How It Works

Model serving is what happens behind the scenes when you call an AI API. The provider runs your input through the model on GPU hardware and returns the output. When self-hosting open-source models like Llama, you need to set up your own model serving infrastructure. Popular serving frameworks include vLLM (high-throughput serving with PagedAttention), TGI (Hugging Face's Text Generation Inference), and Triton (NVIDIA's inference server). These handle batching (combining multiple requests for GPU efficiency), memory management, and concurrent request handling. Cloud platforms like AWS SageMaker, Google Vertex AI, and Replicate abstract away serving complexity. For most builders, model serving is handled by the API provider. You only need to think about it when self-hosting for cost, privacy, or customization reasons. The key metrics are: latency (time to first token), throughput (requests per second), and cost per token. Optimizations like quantization and continuous batching significantly improve all three.

Common Use Cases

  • 1Self-hosting open-source models
  • 2High-throughput inference pipelines
  • 3Custom model deployment
  • 4On-premise AI for regulated industries

Related Terms

Need help implementing Model Serving?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Model Serving in real products every day.

Let's Talk