Qwen3-TTS - DeveloPassion

# Qwen3-TTS Qwen3-TTS is an open-source [[Text-to-Speech (TTS)]] model series developed by the Qwen team at Alibaba Cloud. Released in January 2026 under Apache 2.0. Trained on 5+ million hours of speech data across 10 languages. Uses a Dual-Track hybrid streaming architecture with a discrete multi-codebook LM, enabling both streaming and non-streaming generation from a single model. End-to-end synthesis latency reaches 97ms. ## Model variants - **Qwen3-TTS-12Hz-1.7B**: Flagship model, best quality and control - **Qwen3-TTS-12Hz-0.6B**: Lightweight version balancing efficiency with quality - **Qwen3-TTS CustomVoice**: Premium pre-built speakers with style instructions - **Qwen3-TTS VoiceDesign (1.7B)**: Creates voices from natural language descriptions, no reference audio needed ## Supported languages Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian. Includes multiple dialectal voice profiles. ## Key features - [[Voice Cloning]] from 3 seconds of reference audio (0.95 similarity score) - Voice design from natural language descriptions - 40+ emotion presets - Streaming and non-streaming generation - 97ms end-to-end latency - 9 pre-built premium speakers (Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee) ## References - Announcement: https://qwen.ai/blog?id=qwen3tts-0115 - Source code: https://github.com/QwenLM/Qwen3-TTS - Demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS ## Related - [[Text-to-Speech (TTS)]] - [[Voice Cloning]] - [[Voice Clone Studio]] - [[VibeVoice]] - [[Qwen]]