Building Voice AI: OpenAI TTS + Whisper Integration
Voice is the most natural interface. Here's how to build voice-enabled AI applications that actually work.
The Voice AI Stack
codeLoading...
Speech-to-Text with Whisper
Basic Transcription
typescriptLoading...
With Timestamps
typescriptLoading...
Language Detection
typescriptLoading...
Streaming from Browser
typescriptLoading...
typescriptLoading...
Text-to-Speech with TTS
Basic Speech Generation
typescriptLoading...
Voice Options
typescriptLoading...
Streaming TTS
For real-time playback, stream the audio:
typescriptLoading...
Complete Voice Agent
Putting it all together:
typescriptLoading...
Real-Time Conversations
For true real-time, use OpenAI's Realtime API:
typescriptLoading...
Mobile Integration (iOS)
We built SheGPT with voice in 1 day. Here's the Swift pattern:
swiftLoading...
Cost Optimization
Whisper Costs
- $0.006 per minute of audio
- 10 hours/day = ~$1.80/day
TTS Costs
| Model | Per 1M characters |
|---|---|
| tts-1 | $15 |
| tts-1-hd | $30 |
Optimization Strategies
- Client-side VAD: Only send audio when speech is detected
- Compress audio: Whisper handles various formats efficiently
- Cache common responses: TTS the same phrases? Cache them
- Use tts-1 for most cases: HD only for premium features
typescriptLoading...
Production Checklist
- Audio format validation (Whisper supports: mp3, wav, webm, etc.)
- File size limits (max 25MB for Whisper)
- Timeout handling for long audio
- Graceful degradation when APIs fail
- Rate limiting per user
- Cost monitoring and alerts
- Logging for debugging
Frequently Asked Questions
Q: How much does it cost to build a voice AI application with OpenAI?
Whisper speech-to-text costs $0.006 per minute of audio, and TTS (text-to-speech) costs $15 per million characters for standard quality or $30 for HD. For a typical voice app processing 10 hours of audio daily, Whisper costs about $1.80/day. The total cost depends on usage volume, but a consumer app with moderate traffic can run voice features for $200-500/month in API costs.
Q: What is the difference between OpenAI Whisper and the Realtime API for voice?
Whisper is an asynchronous pipeline where audio is recorded, sent for transcription, processed by an LLM, and then converted back to speech via TTS. The Realtime API enables true real-time conversation with simultaneous audio input and output, server-side voice activity detection, and sub-second latency. Whisper is cheaper and simpler; the Realtime API delivers a more natural conversational experience.
Q: What audio formats does OpenAI Whisper support?
Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats with a maximum file size of 25MB. For browser-based applications, WebM is the most common format from MediaRecorder. For mobile apps, MP4 or M4A from the device microphone works well. Whisper also automatically detects the spoken language if you do not specify one.
Q: How do you optimize voice AI for mobile apps?
Key optimizations include client-side voice activity detection (VAD) so you only send audio when speech is detected, compressing audio before transmission, caching common TTS responses to avoid regenerating the same phrases, and using the standard tts-1 model instead of tts-1-hd for most interactions. On iOS, the AVAudioEngine framework handles recording efficiently, and streaming responses back gives users immediate audio feedback.
Building a Voice App?
We specialize in voice AI applications.
AI 4U Labs builds production voice AI. SheGPT shipped in 1 day with full voice support.


