Building Voice AI: OpenAI TTS + Whisper Integration#

Voice is the most natural interface. Here's how to build voice-enabled AI applications that actually work.

The Voice AI Stack#

code
Loading...

Speech-to-Text with Whisper#

Basic Transcription#

typescript
Loading...

With Timestamps#

typescript
Loading...

Language Detection#

typescript
Loading...

Streaming from Browser#

typescript
Loading...

typescript
Loading...

Text-to-Speech with TTS#

Basic Speech Generation#

typescript
Loading...

Voice Options#

typescript
Loading...

Streaming TTS#

For real-time playback, stream the audio:

typescript
Loading...

Complete Voice Agent#

Putting it all together:

typescript
Loading...

Real-Time Conversations#

For true real-time, use OpenAI's Realtime API:

typescript
Loading...

Mobile Integration (iOS)#

We built SheGPT with voice in 1 day. Here's the Swift pattern:

swift
Loading...

Cost Optimization#

Whisper Costs#

$0.006 per minute of audio
10 hours/day = ~$1.80/day

TTS Costs#

Model	Per 1M characters
tts-1	$15
tts-1-hd	$30

Optimization Strategies#

Client-side VAD: Only send audio when speech is detected
Compress audio: Whisper handles various formats efficiently
Cache common responses: TTS the same phrases? Cache them
Use tts-1 for most cases: HD only for premium features

typescript
Loading...

Production Checklist#

Audio format validation (Whisper supports: mp3, wav, webm, etc.)
File size limits (max 25MB for Whisper)
Timeout handling for long audio
Graceful degradation when APIs fail
Rate limiting per user
Cost monitoring and alerts
Logging for debugging

Frequently Asked Questions#

Q: How much does it cost to build a voice AI application with OpenAI?#

Whisper speech-to-text costs $0.006 per minute of audio, and TTS (text-to-speech) costs $15 per million characters for standard quality or $30 for HD. For a typical voice app processing 10 hours of audio daily, Whisper costs about $1.80/day. The total cost depends on usage volume, but a consumer app with moderate traffic can run voice features for $200-500/month in API costs.

Q: What is the difference between OpenAI Whisper and the Realtime API for voice?#

Whisper is an asynchronous pipeline where audio is recorded, sent for transcription, processed by an LLM, and then converted back to speech via TTS. The Realtime API enables true real-time conversation with simultaneous audio input and output, server-side voice activity detection, and sub-second latency. Whisper is cheaper and simpler; the Realtime API delivers a more natural conversational experience.

Q: What audio formats does OpenAI Whisper support?#

Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats with a maximum file size of 25MB. For browser-based applications, WebM is the most common format from MediaRecorder. For mobile apps, MP4 or M4A from the device microphone works well. Whisper also automatically detects the spoken language if you do not specify one.

Q: How do you optimize voice AI for mobile apps?#

Key optimizations include client-side voice activity detection (VAD) so you only send audio when speech is detected, compressing audio before transmission, caching common TTS responses to avoid regenerating the same phrases, and using the standard tts-1 model instead of tts-1-hd for most interactions. On iOS, the AVAudioEngine framework handles recording efficiently, and streaming responses back gives users immediate audio feedback.

Building a Voice App?#

We specialize in voice AI applications.

Discuss Your Voice AI Project

AI 4U Labs builds production voice AI. SheGPT shipped in 1 day with full voice support.

Building Voice AI: OpenAI TTS + Whisper Integration