AI Glossaryapplications

Speech-to-Text (STT)

AI technology that converts spoken audio into written text, enabling voice input, transcription, and voice-controlled interfaces.

How It Works

Speech-to-text (also called automatic speech recognition or ASR) converts audio into text. OpenAI's Whisper model is the industry standard: it supports 90+ languages, handles accents and background noise well, and is available as both an API and an open-source model you can self-host. Deepgram and AssemblyAI offer specialized STT with features like speaker diarization (identifying who said what) and real-time streaming transcription. For mobile apps, Apple's Speech framework and Android's SpeechRecognizer provide on-device STT with no API costs and no internet requirement. These are ideal for simple voice input but less accurate than cloud models for complex or multilingual audio. In production, STT is commonly paired with an LLM and TTS to create voice assistants: speech goes in (STT), gets processed by the AI (LLM), and the response is spoken aloud (TTS). Key considerations: real-time vs. batch transcription, language support, handling of domain-specific terminology, and whether on-device or cloud processing better fits your privacy and latency requirements.

Common Use Cases

  • 1Voice input for AI assistants
  • 2Meeting transcription
  • 3Podcast and video captioning
  • 4Voice search
  • 5Dictation and note-taking

Related Terms

Need help implementing Speech-to-Text?

AI 4U Labs builds production AI apps in 2-4 weeks. We use Speech-to-Text in real products every day.

Let's Talk