# YuE An open-source foundation model for full-song music generation. "YuE" means "music" and "happiness" in Chinese. Primary task: lyrics-to-song; given lyrics and genre tags, produces complete multi-minute songs (up to ~5 min) with separate vocal and accompaniment tracks. Built on LLaMA2. Apache 2.0 license. Created by HKUST and M-A-P (Multimodal Art Projection). ## Architecture Two-stage pipeline: | Component | Parameters | Role | |-----------|-----------|------| | Stage 1 (s1) | 7B | Language model generating semantic music tokens via track-decoupled next-token prediction | | Stage 2 (s2) | 1B | Codec model converting semantic tokens to audio waveforms | | Upsampler | -- | Optional audio quality enhancement | Key innovations: track-decoupled next-token prediction (handles vocal + instrumental mix), structural progressive conditioning (long-context lyrical alignment), multitask/multiphase pre-training (trillions of tokens). ## Inference Modes - **CoT (Chain-of-Thought)**: reasoning-based generation - **ICL (In-Context Learning)**: reference audio-guided; enables style transfer and voice cloning (dual-track or single-track) ## Supported Features - **Languages**: English, Mandarin Chinese, Cantonese, Japanese, Korean; code-switching supported - **Genres**: Metal, Jazz, Rap, Pop, Ballad, Soul, Country, Alternative Rock, Indie, Children's, Folk, Rock, K-pop, Mandarin Pop, and more (top 200 tags) - **Genre tag structure**: genre, instrument, mood, gender, timbre - **Vocal techniques**: scatting, death growl, mix voice, belting, a cappella, Beijing Opera, traditional Chinese folk singing - **Instruments**: piano, guitar, drums, synthesizer, bass, violin, keyboard, electronic, harmonica - **Structural labels**: `[verse]`, `[chorus]`, `[bridge]`, `[outro]` - **Additional**: voice cloning, style transfer, song continuation, LoRA fine-tuning ## Hardware Requirements - 24GB GPU: up to 2 sessions (~30s segments) - 80GB+ GPU (H800/A100): full songs (4+ sessions) - FlashAttention 2 mandatory for long-form generation - Speed: ~150s per 30s audio on H800; ~360s on RTX 4090 ## References - Website: https://map-yue.github.io/ - Source code: https://github.com/multimodal-art-projection/YuE - Paper: arXiv:2503.08638 ## Related - [[Large Language Models (LLMs)]]