# vLLM vLLM is a high-throughput, memory-efficient open source library for [[Large Language Models (LLMs)]] inference and serving. Originally created at UC Berkeley's Sky Computing Lab, it has evolved into a community-driven project. Written primarily in [[Python]] (with CUDA and C++ components), licensed under the [[Apache 2.0 License]]. ## Key Features - **PagedAttention**: core innovation for efficient management of attention key-value memory, inspired by virtual memory paging in operating systems - **Continuous batching**: dynamically batches incoming requests for higher throughput - **Quantization support**: GPTQ, AWQ, AutoRound, INT4, INT8, FP8 - **Distributed inference**: tensor, pipeline, data, and expert parallelism - **Hardware support**: NVIDIA/AMD GPUs, Intel/PowerPC/Arm CPUs, TPUs, Gaudi, Ascend - **Multi-LoRA**: serve multiple fine-tuned adapters from a single base model - **Prefix caching**: reuse computed KV cache across requests sharing common prefixes - **Speculative decoding**: faster generation using draft models - **OpenAI-compatible API server**: drop-in replacement for [[OpenAI]] API endpoints ## Supported Model Types - Transformer LLMs (Llama, etc.) - Mixture-of-Experts (Mixtral, DeepSeek) - Embedding models - Multi-modal architectures ## References - https://github.com/vllm-project/vllm ## Related - [[Large Language Models (LLMs)]] - [[Python]] - [[Apache 2.0 License]]