AI Glossaryinfrastructure

Latency

The time delay between sending a request to an AI model and receiving the response, critical for real-time user-facing applications.

How It Works

In AI applications, latency has two key measurements: time-to-first-token (TTFT), how long before the first word appears, and total generation time. TTFT matters most for user experience because streaming makes the rest feel fast. Typical TTFT for cloud APIs: GPT-5-mini ~200-400ms, GPT-5.2 ~400-800ms, Claude Opus 4.6 ~500-1000ms. Factors that increase latency: larger models, longer prompts (more input tokens to process), complex reasoning modes, geographic distance to the API server, and provider load. Factors that decrease it: smaller models, shorter prompts, edge deployment, request caching, and streaming. For production apps, target <500ms TTFT for conversational features and <2 seconds total for short responses. Strategies to reduce latency: (1) Use the smallest model that meets quality needs, (2) Keep prompts concise, (3) Enable streaming for all user-facing features, (4) Cache common requests, (5) Use provider regions closest to your users. For non-user-facing tasks, latency matters less and you can optimize for cost instead.

Common Use Cases

  • 1Optimizing chat response times
  • 2Choosing between model tiers
  • 3Real-time feature design
  • 4User experience benchmarking

Related Terms

Need help implementing Latency?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Latency in real products every day.

Let's Talk