# vLLM
vLLM is a high-throughput, memory-efficient open source library for [[Large Language Models (LLMs)]] inference and serving. Originally created at UC Berkeley's Sky Computing Lab, it has evolved into a community-driven project. Written primarily in [[Python]] (with CUDA and C++ components), licensed under the [[Apache 2.0 License]].
## Key Features
- **PagedAttention**: core innovation for efficient management of attention key-value memory, inspired by virtual memory paging in operating systems
- **Continuous batching**: dynamically batches incoming requests for higher throughput
- **Quantization support**: GPTQ, AWQ, AutoRound, INT4, INT8, FP8
- **Distributed inference**: tensor, pipeline, data, and expert parallelism
- **Hardware support**: NVIDIA/AMD GPUs, Intel/PowerPC/Arm CPUs, TPUs, Gaudi, Ascend
- **Multi-LoRA**: serve multiple fine-tuned adapters from a single base model
- **Prefix caching**: reuse computed KV cache across requests sharing common prefixes
- **Speculative decoding**: faster generation using draft models
- **OpenAI-compatible API server**: drop-in replacement for [[OpenAI]] API endpoints
## Supported Model Types
- Transformer LLMs (Llama, etc.)
- Mixture-of-Experts (Mixtral, DeepSeek)
- Embedding models
- Multi-modal architectures
## References
- https://github.com/vllm-project/vllm
## Related
- [[Large Language Models (LLMs)]]
- [[Python]]
- [[Apache 2.0 License]]