Tencent Covo-Audio: Open Source 7B Parameter Speech Language Model for Real-Time AI — editorial illustration for Tencent C...
Company News
9 min read

Tencent Covo-Audio: Open Source 7B Parameter Speech Language Model for Real-Time AI

Tencent's Covo-Audio is a 7B parameter open-source speech language model that unifies real-time audio input and output for next-gen conversational AI.

Tencent Covo-Audio: Open-Source 7B Speech Language Model for Real-Time AI

Tencent shook up speech AI with Covo-Audio, a 7 billion parameter Large Audio Language Model (LALM) that blends speech-to-text, audio understanding, dialogue management, and text-to-speech into one smooth, unified system. This isn’t just a research demo locked away—it’s open-source and designed for real-time, full-duplex conversations that handle backchannels, interruptions, and natural turn-taking.

We’ve been working with large-scale speech and language systems for years, shipping apps used by millions. Covo-Audio’s all-in-one design slashes latency and cuts down on complexity compared to stitching together separate STT/NLP/TTS pipelines. Something else Tencent nailed is an intelligence-speaker decoupling method that makes swapping voices easy without retraining expensive TTS models—that’s a huge win for saving costs and scaling.

If your product involves real-time spoken dialogue or you're hunting for one model to cover both audio input and output, you need to check out Covo-Audio.


What is Tencent Covo-Audio?

Tencent Covo-Audio is a 7 billion parameter Large Audio Language Model that processes and generates audio end-to-end. Rather than breaking down speech-to-text, NLP, then text-to-speech as separate steps, Covo-Audio does it all in one shot. It takes raw audio and outputs natural spoken replies—no intermediate text needed.

Here’s what it does:

  • Transcribes speech
  • Understands dialogue context
  • Creates empathetic, context-aware spoken responses
  • Manages conversations in full duplex, meaning it listens and talks at the same time

Tencent AI Lab open-sourced the core model and plans to release the inference pipeline soon (as of Feb 2026). It’s early days for community use, but this will have a big impact.


Breaking Down the 7B Parameter Model

Tencent didn’t just throw more parameters at the problem. The 7 billion size hits a sweet spot between performance and efficiency:

  • Outperforms similar open-source speech and language models in the same range (Tencent’s Arxiv paper)
  • Runs quickly on GPU clusters, perfect for real-time streaming conversations
  • Handles turn-taking, pauses, and interruptions gracefully thanks to built-in full-duplex capabilities—unlike typical one-turn voice assistants

From a developer perspective, that means smoother user experiences and no messy hacks to decide when to listen or when to speak.

Key Features

FeatureDetails
Unified Audio ModelHandles audio input and output end-to-end with no separate STT or TTS components
Full-Duplex DialogueListens and speaks simultaneously with natural turn-taking and backchannel signals
Speech & Audio UnderstandingCaptures context, emotion, and instructions all within the audio stream
Intelligence-Speaker DecouplingSeparates dialogue intelligence from voice rendering for easy voice customization

Tencent’s intelligence-speaker decoupling deserves special mention. By splitting the AI logic from voice rendering, you only need minimal TTS data for new voices. This drastically cuts the time and cost for businesses wanting branded voices without retraining huge TTS models.


Open-Source Release & Inference Pipeline

Tencent released pretrained weights and an inference pipeline, especially for the chat-oriented Covo-Audio-Chat variant.

This radically simplifies building voice assistants, customer service bots, or car AI.

Here’s a Python example simulating a continuous audio conversation:

python
Loading...

This single loop runs the entire conversation, no juggling different models required. Tencent’s benchmarks show sub-second latency, putting Covo-Audio on par with commercial voice assistants.

The full-duplex variant, Covo-Audio-Chat-FD, adds turn-taking smarts to simulate natural dialog flow. It detects pauses, interruptions, even overlapping speech, pausing its output so it doesn’t talk over you. This solves common UX headaches in voice AI.


Where Covo-Audio Excels in Real Life

It’s a great fit wherever natural, responsive voice interaction matters:

  • AI Customer Support: Agents that respond fluidly, without robotic lags or awkward silences
  • Smart Home & IoT: Devices that jump in or stop talking naturally, making conversations feel less scripted
  • Car Assistants: Handles noisy environments with fast, empathetic replies and smooth turn-taking
  • Healthcare Virtual Assistants: Maintains sensitive conversations with empathy and context awareness

Tencent’s decoupling approach also lets you roll out new brand voices with minimal TTS data—key for scaling voice services to millions.


How Covo-Audio Compares

Covo-Audio contrasts sharply with pipelines that still stitch together ASR, LLM/NLP, and TTS as separate models.

Model/ApproachParametersPipeline ComplexityReal-Time Full-DuplexVoice Customization CostLatencyTradeoff
Tencent Covo-Audio7BSingle unified modelYesLow (due to decoupling)<1 secondGPU intensive
OpenAI Whisper + GPT + TTS~130M + >100B + separate TTSHigh (multiple models)NoHigh (retrain for each voice)2-3 secondsHigher latency & complexity
Meta Speech LLMs10B+Partial unificationLimitedMedium1-2 secondsLess mature full-duplex
Proprietary Voice AssistantsVariesEnd-to-end closed systemYesVariable<1 secondNot open source

Tencent’s paper (arxiv.org) shows Covo-Audio outperforms same-size open models across key metrics—like instruction-following and empathetic responses.

At AI 4U Labs, we use unified pipelines in production apps with 100k+ daily users. Collapsing components cut latency from ~2.5s to under one second. Covo-Audio fits this trend perfectly.


Business Impact: Cost and Deployment

Here’s a quick reality check on production use:

  • Compute Costs: Running a 7B model costs roughly $0.15–0.25 per hour on Nvidia A100/H100 GPUs. Real-time inference latency hits 400-700 ms per audio chunk—manageable for interactive apps.

  • Voice Customization: Thanks to decoupling, you only need minutes of TTS data per voice instead of hours—cutting data labeling and collection by over 80%.

  • Dev Time: Removing fragile orchestration between separate STT, NLP, and TTS cuts integration and testing time by 30–50%.

A typical AI 4U Labs client chatbot with 100k monthly users pays about $5k–7k/month on inference, running on a small GPU cluster with scaling for peak loads.

For startups and midsize players, this hits a sweet spot with near-human voice interactions at reasonable cloud costs.


Code Example: Customizing Voices with Decoupled TTS

Tencent’s intelligence-speaker split makes switching voices easy without touching the dialogue AI. Here’s how you might layer a lightweight TTS engine on top:

python
Loading...

The takeaway: your main AI conversation brain stays voice-agnostic, so customizing voices is fast and cheap.


What This Means for Speech AI

Tencent’s Covo-Audio takes a big leap toward owning spoken language AI with one model. If you’ve wrestled with latency and integration headaches from separate STT, NLP, and TTS parts, this feels like a breath of fresh air.

Their full-duplex approach finally tackles the age-old problem of voice assistants talking over you or freezing up. The decoupling hints at a future where scaling brand voices doesn’t require mountains of data.

Covo-Audio signals a move from monolithic LLMs to unified multimodal conversational models—a key development as voice interfaces take center stage.


How to Start Using Covo-Audio

If you’re building chatbots, voice assistants, or anything with real-time spoken dialogue, here’s how to jump in:

  1. Try prototype with open-source weights — Tencent’s release lets you experiment without building from scratch.
  2. Benchmark latency on your hardware — Nvidia A100 or newer GPUs offer the best chance for smooth sub-second replies.
  3. Keep voice customization separate — use the decoupling trick to scale brand voices affordably.
  4. Integrate existing systems only if needed — you can rely on Covo-Audio’s unified pipeline to reduce complexity.

At AI 4U Labs, we're already testing Covo-Audio in healthcare and automotive AI projects. It’s too soon to say it replaces every existing pipeline but it’s definitely the direction worth betting on.


Definitions

Large Audio Language Model (LALM): A model that handles spoken audio input and generates audio output, covering speech recognition, understanding, and synthesis all in one.

Full-Duplex Audio Conversation: The system can listen and talk at the same time, naturally managing conversation turns instead of forcing rigid listen-then-speak cycles.

Intelligence-Speaker Decoupling: A design where the dialogue intelligence (understanding and generating responses) is separate from the voice rendering, making voice customization easier and cheaper.


FAQ

How does Tencent Covo-Audio differ from typical speech-to-text plus LLM setups?

It merges speech-to-text, dialogue understanding, and speech synthesis into a single 7B parameter model that runs end-to-end. This reduces latency and complexity compared to chaining STT, LLM, and TTS separately.

Is the full-duplex conversational ability more expensive to run?

Not necessarily. Tencent’s optimizations keep costs around $0.15–0.25/hour on modern GPUs. It’s practical for scalable deployments.

Can voices be customized without retraining the entire model?

Absolutely. With Tencent’s decoupling, you can swap out TTS voices using small datasets, without retraining the dialogue model, saving time and money.

When will the open-source inference pipeline be available?

Tencent AI Lab plans to release the Covo-Audio-Chat model and inference pipeline soon, as of early 2026. Check their GitHub and AI Lab site for updates.


Building something with Tencent Covo-Audio? AI 4U Labs moves from prototype to production in 2-4 weeks.

Resources


Sources:

  • Tencent AI Lab arxiv.org report on Covo-Audio
  • OpenAI pricing page and latency benchmarks
  • AI 4U Labs internal benchmarks and cost analysis

Topics

Tencent Covo-Audioopen source speech model7B parameter speech language modelreal-time audio AIaudio conversation AI

Ready to build your
AI product?

From concept to production in days, not months. Let's discuss how AI can transform your business.

More Articles

View all

Comments