# YuE
An open-source foundation model for full-song music generation. "YuE" means "music" and "happiness" in Chinese. Primary task: lyrics-to-song; given lyrics and genre tags, produces complete multi-minute songs (up to ~5 min) with separate vocal and accompaniment tracks. Built on LLaMA2. Apache 2.0 license. Created by HKUST and M-A-P (Multimodal Art Projection).
## Architecture
Two-stage pipeline:
| Component | Parameters | Role |
|-----------|-----------|------|
| Stage 1 (s1) | 7B | Language model generating semantic music tokens via track-decoupled next-token prediction |
| Stage 2 (s2) | 1B | Codec model converting semantic tokens to audio waveforms |
| Upsampler | -- | Optional audio quality enhancement |
Key innovations: track-decoupled next-token prediction (handles vocal + instrumental mix), structural progressive conditioning (long-context lyrical alignment), multitask/multiphase pre-training (trillions of tokens).
## Inference Modes
- **CoT (Chain-of-Thought)**: reasoning-based generation
- **ICL (In-Context Learning)**: reference audio-guided; enables style transfer and voice cloning (dual-track or single-track)
## Supported Features
- **Languages**: English, Mandarin Chinese, Cantonese, Japanese, Korean; code-switching supported
- **Genres**: Metal, Jazz, Rap, Pop, Ballad, Soul, Country, Alternative Rock, Indie, Children's, Folk, Rock, K-pop, Mandarin Pop, and more (top 200 tags)
- **Genre tag structure**: genre, instrument, mood, gender, timbre
- **Vocal techniques**: scatting, death growl, mix voice, belting, a cappella, Beijing Opera, traditional Chinese folk singing
- **Instruments**: piano, guitar, drums, synthesizer, bass, violin, keyboard, electronic, harmonica
- **Structural labels**: `[verse]`, `[chorus]`, `[bridge]`, `[outro]`
- **Additional**: voice cloning, style transfer, song continuation, LoRA fine-tuning
## Hardware Requirements
- 24GB GPU: up to 2 sessions (~30s segments)
- 80GB+ GPU (H800/A100): full songs (4+ sessions)
- FlashAttention 2 mandatory for long-form generation
- Speed: ~150s per 30s audio on H800; ~360s on RTX 4090
## References
- Website: https://map-yue.github.io/
- Source code: https://github.com/multimodal-art-projection/YuE
- Paper: arXiv:2503.08638
## Related
- [[Large Language Models (LLMs)]]