VibeVoice - DeveloPassion

# VibeVoice VibeVoice is a family of open-source frontier voice AI models by Microsoft, covering both [[Text-to-Speech (TTS)]] and speech recognition (ASR). Licensed under MIT. A core innovation is the use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz, efficiently preserving audio fidelity while boosting computational efficiency for long sequences. ## Model variants - **VibeVoice-TTS (1.5B / Large / Large-4bit)**: Long-form multi-speaker TTS, synthesizes speech up to 90 minutes with up to 4 distinct speakers. Ideal for podcasts, audiobooks, and narratives - **VibeVoice-Realtime-0.5B**: Real-time TTS with streaming text input and robust long-form generation - **VibeVoice-ASR**: Unified speech-to-text model handling 60-minute long-form audio in a single pass, generating structured transcriptions with speaker identification (Who), timestamps (When), and content (What) ## Key features - Zero-shot [[Voice Cloning]] from short reference audio - Up to 90 minutes of continuous multi-speaker speech - Cross-lingual support - Ultra-low frame rate tokenization (7.5 Hz) for efficiency - May spontaneously add background sounds for realism ## Note Microsoft removed the VibeVoice-TTS code from the official repository due to responsible use concerns. Models remain available on HuggingFace and through community forks. ## References - Project page: https://microsoft.github.io/VibeVoice/ - Source code: https://github.com/microsoft/VibeVoice - HuggingFace (1.5B): https://huggingface.co/microsoft/VibeVoice-1.5B - HuggingFace (Realtime): https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B - https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md ## Related - [[Text-to-Speech (TTS)]] - [[Voice Cloning]] - [[Voice Clone Studio]] - [[Qwen3-TTS]]