Building Voice AI: OpenAI TTS + Whisper Integration — editorial illustration for voice AI
Tutorial
9 min read

Building Voice AI: OpenAI TTS + Whisper Integration

A complete guide to building voice-enabled AI applications using OpenAI's TTS and Whisper APIs. From real-time conversations to async processing.

Building Voice AI: OpenAI TTS + Whisper Integration

Voice is the most natural interface. Here's how to build voice-enabled AI applications that actually work.

The Voice AI Stack

code
Loading...

Speech-to-Text with Whisper

Basic Transcription

typescript
Loading...

With Timestamps

typescript
Loading...

Language Detection

typescript
Loading...

Streaming from Browser

typescript
Loading...
typescript
Loading...

Text-to-Speech with TTS

Basic Speech Generation

typescript
Loading...

Voice Options

typescript
Loading...

Streaming TTS

For real-time playback, stream the audio:

typescript
Loading...

Complete Voice Agent

Putting it all together:

typescript
Loading...

Real-Time Conversations

For true real-time, use OpenAI's Realtime API:

typescript
Loading...

Mobile Integration (iOS)

We built SheGPT with voice in 1 day. Here's the Swift pattern:

swift
Loading...

Cost Optimization

Whisper Costs

  • $0.006 per minute of audio
  • 10 hours/day = ~$1.80/day

TTS Costs

ModelPer 1M characters
tts-1$15
tts-1-hd$30

Optimization Strategies

  1. Client-side VAD: Only send audio when speech is detected
  2. Compress audio: Whisper handles various formats efficiently
  3. Cache common responses: TTS the same phrases? Cache them
  4. Use tts-1 for most cases: HD only for premium features
typescript
Loading...

Production Checklist

  • Audio format validation (Whisper supports: mp3, wav, webm, etc.)
  • File size limits (max 25MB for Whisper)
  • Timeout handling for long audio
  • Graceful degradation when APIs fail
  • Rate limiting per user
  • Cost monitoring and alerts
  • Logging for debugging

Frequently Asked Questions

Q: How much does it cost to build a voice AI application with OpenAI?

Whisper speech-to-text costs $0.006 per minute of audio, and TTS (text-to-speech) costs $15 per million characters for standard quality or $30 for HD. For a typical voice app processing 10 hours of audio daily, Whisper costs about $1.80/day. The total cost depends on usage volume, but a consumer app with moderate traffic can run voice features for $200-500/month in API costs.

Q: What is the difference between OpenAI Whisper and the Realtime API for voice?

Whisper is an asynchronous pipeline where audio is recorded, sent for transcription, processed by an LLM, and then converted back to speech via TTS. The Realtime API enables true real-time conversation with simultaneous audio input and output, server-side voice activity detection, and sub-second latency. Whisper is cheaper and simpler; the Realtime API delivers a more natural conversational experience.

Q: What audio formats does OpenAI Whisper support?

Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats with a maximum file size of 25MB. For browser-based applications, WebM is the most common format from MediaRecorder. For mobile apps, MP4 or M4A from the device microphone works well. Whisper also automatically detects the spoken language if you do not specify one.

Q: How do you optimize voice AI for mobile apps?

Key optimizations include client-side voice activity detection (VAD) so you only send audio when speech is detected, compressing audio before transmission, caching common TTS responses to avoid regenerating the same phrases, and using the standard tts-1 model instead of tts-1-hd for most interactions. On iOS, the AVAudioEngine framework handles recording efficiently, and streaming responses back gives users immediate audio feedback.

Building a Voice App?

We specialize in voice AI applications.

Discuss Your Voice AI Project


AI 4U Labs builds production voice AI. SheGPT shipped in 1 day with full voice support.

Topics

voice AIOpenAI TTSWhisperspeech recognitiontext to speech

Ready to build your
AI product?

From concept to production in days, not months. Let's discuss how AI can transform your business.

More Articles

View all

Comments