# VibeVoice
VibeVoice is a family of open-source frontier voice AI models by Microsoft, covering both [[Text-to-Speech (TTS)]] and speech recognition (ASR). Licensed under MIT.
A core innovation is the use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz, efficiently preserving audio fidelity while boosting computational efficiency for long sequences.
## Model variants
- **VibeVoice-TTS (1.5B / Large / Large-4bit)**: Long-form multi-speaker TTS, synthesizes speech up to 90 minutes with up to 4 distinct speakers. Ideal for podcasts, audiobooks, and narratives
- **VibeVoice-Realtime-0.5B**: Real-time TTS with streaming text input and robust long-form generation
- **VibeVoice-ASR**: Unified speech-to-text model handling 60-minute long-form audio in a single pass, generating structured transcriptions with speaker identification (Who), timestamps (When), and content (What)
## Key features
- Zero-shot [[Voice Cloning]] from short reference audio
- Up to 90 minutes of continuous multi-speaker speech
- Cross-lingual support
- Ultra-low frame rate tokenization (7.5 Hz) for efficiency
- May spontaneously add background sounds for realism
## Note
Microsoft removed the VibeVoice-TTS code from the official repository due to responsible use concerns. Models remain available on HuggingFace and through community forks.
## References
- Project page: https://microsoft.github.io/VibeVoice/
- Source code: https://github.com/microsoft/VibeVoice
- HuggingFace (1.5B): https://huggingface.co/microsoft/VibeVoice-1.5B
- HuggingFace (Realtime): https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B
- https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md
## Related
- [[Text-to-Speech (TTS)]]
- [[Voice Cloning]]
- [[Voice Clone Studio]]
- [[Qwen3-TTS]]