Introducing Qwen3-TTS: A Breakthrough Open Multilingual Text-to-Speech Suite

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control

what’s Qwen3-TTS?

Qwen3-TTS by Alibaba Cloud’s Qwen team is a revolutionary family of multilingual text-to-speech (TTS) models designed to accomplish three primary functions: voice cloning, voice design, and high-quality speech generation. By open-sourcing this suite, they’ve made it accessible for developers and researchers to explore the potential of advanced voice technology. This initiative not only democratizes access to sophisticated voice synthesis tools but also encourages innovation in various sectors, including entertainment, education, and customer service.

Key Features of Qwen3-TTS

Model Family and Functionalities

The Qwen3-TTS suite offers a range of models packaged to perform different tasks effectively. It employs a 12Hz speech tokenizer along with two language model sizes: 0.6B and 1.7B. The suite consists of five distinct models, each serving unique purposes:

Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base: Ideal for voice cloning and standard TTS.
Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice: Tailored for promptable preset speakers.
Qwen3-TTS-12Hz-1.7B-VoiceDesign: Allows users to create freeform voices based on natural language descriptions.

All models support an impressive array of 10 languages, including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. The CustomVoice models offer nine curated timbres, featuring distinct voice styles such as Vivian, a bright young female voice from China, Ryan, a dynamic male voice from England, and Ono_Anna, a playful female voice from Japan. This extensive range of voice options enhances the user experience, allowing for personalized interactions that cater to diverse audiences. (CoinDesk)

Innovative Voice Design Capabilities

One of the standout features is the VoiceDesign model, which translates text instructions into unique voice outputs. For instance, users can specify that they want the voice to sound “like a nervous teenage boy with rising intonation.” This flexibility makes it easier for developers to create engaging and varied voice outputs for different applications. The ability to manipulate voice characteristics based on descriptive language opens up new avenues for storytelling, marketing, and user engagement across various platforms. You might also enjoy our guide on Gemini 3 Flash: A Game-Changer for Enterprises with Speed an.

Understanding the Architecture and Technology

Tokenizer and Streaming Efficiency

Qwen3-TTS employs a dual-track language model to optimize its processes—one track focuses on predicting discrete acoustic tokens from text while the other manages alignment and control signals. Trained on over five million hours of multilingual speech, the model follows a three-stage pre-training approach, enhancing its ability to manage long contexts of up to 32,768 tokens effectively. This solid training foundation allows Qwen3-TTS to excel in generating contextually relevant speech, which is vital for maintaining user engagement.

A key component of this system is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at a speed of 12.5 frames per second, processing each token in approximately 80 milliseconds. With 16 quantizers and a 2048-entry codebook, it outshines competitors like SpeechTokenizer and Mimi on key performance metrics, offering low latency streaming and high fidelity audio output. For more details, you can check the original research paper. This performance capability is particularly beneficial for applications where timing and clarity are paramount, such as in interactive voice response systems and real-time translation services.

Real-Time Performance

On testing, the first packet latency for the 0.6B and 1.7B Base models was approximately 97 ms and 101 ms, respectively, even when handling multiple concurrent requests. This low latency is necessary for applications requiring real-time interaction. Such efficiency enhances user satisfaction in scenarios like virtual assistants or customer support, where prompt responses are critical to maintaining a conversational flow.

Alignment and Control Mechanisms

Post-Training Enhancements

Qwen3-TTS utilizes a progressive alignment process post-training. Initially, Direct Preference Optimization (DPO) aligns the generated speech with human preferences derived from multilingual datasets. This is followed by GSPO (Gradient-based Speaker Preference Optimization), which refines stability and prosody, leading to a final speaker fine-tuning stage that preserves the core capabilities of the model. This meticulous approach ensures that the generated voices not only sound natural but also resonate well with diverse audiences, enhancing the overall user experience. For more tips, check out Ant Group Unveils Ling-1T: A Game-Changer in AI Technology.

Instruction-Following and Customization

Instruction following in Qwen3-TTS is implemented through a ChatML format. Here, users can provide text instructions regarding style, emotion, or pace before the input. This interface not only powers VoiceDesign but also supports fine-tuned edits for cloned speakers, enhancing usability. Such customization options empower users to tailor the output to fit specific contexts, whether it’s for a playful tone in a children’s app or a professional tone in corporate training materials. (Bitcoin.org)

Performance Benchmarks and Multilingual Capabilities

Evaluations and Comparisons

In evaluations conducted on the Seed-TTS test set, the Qwen3-TTS-12Hz-1.7B-Base model achieved a remarkable Word Error Rate (WER) of 0.77 in Chinese and 1.24 in English. Notably, the 1.24 WER in English places it among the top-performing systems in the field, while its Chinese performance is competitive with the best available solutions. These benchmarks highlight the model’s capability to deliver accurate and high-quality speech generation, making it a strong contender in the TTS scene.

On a multilingual TTS test set, Qwen3-TTS excelled by achieving the lowest WER in six out of ten languages, showing its versatility across various linguistic contexts. On top of that, it demonstrated superior speaker similarity compared to other systems like MiniMax-Speech and ElevenLabs Multilingual v2. This indicates that Qwen3-TTS not only performs well in terms of accuracy but also in maintaining the unique characteristics of different speakers, a important factor for applications in media and entertainment.

Conclusion: The Future of TTS Technology

Key Takeaways

Qwen3-TTS is a groundbreaking open-source multilingual TTS suite offering:

A full-stack solution for high-quality TTS, voice cloning, and voice design across ten languages.
An efficient discrete codec and real-time streaming capabilities, providing low latency and high audio quality.
Task-specific variants that enhance voice cloning and customization, ensuring flexible application possibilities.
A multi-stage alignment pipeline leading to low error rates and high speaker similarity across languages.

For anyone interested in exploring the power of TTS technology, Qwen3-TTS is a significant step forward. Check out the repository, model weights, and playground to dive deeper into its functionalities. You can also connect with the team on Twitter, join the ML subreddit community, or subscribe to their newsletter for the latest updates! As TTS technology continues to evolve, innovations like Qwen3-TTS will undoubtedly play a major role in shaping the future of human-computer interaction and communication.