Neural Processing Unit (NPU)

# Neural Processing Unit (NPU) A specialized processor designed to accelerate machine learning workloads — particularly the matrix multiplications and activations at the heart of [[Neural Networks (NNs)]]. NPUs sit alongside CPUs and GPUs in modern devices, providing dramatically better performance-per-watt for ML tasks. Also called: AI accelerator, ML accelerator, AI Engine, Neural Engine. ## Why NPUs Exist GPUs accelerate ML well but were designed for graphics. CPUs are general-purpose. NPUs are purpose-built for ML: - Optimized for low-precision math (INT8, INT4, BF16) - Massively parallel matrix operations - Often 5-10× more efficient (perf/watt) than GPUs for ML workloads - Critical for sustained ML on battery-powered devices ## Where They Show Up | Vendor | NPU | |---|---| | Apple | Neural Engine (every iPhone since A11, every M-series Mac) | | Qualcomm | Hexagon NPU (Snapdragon mobile + Copilot+ PCs) | | Intel | NPU (Core Ultra "Meteor Lake" and newer) | | AMD | XDNA / Ryzen AI | | Google | Tensor TPU (Pixel phones) | | Microsoft | Co-designed with Qualcomm/Intel for Copilot+ PCs | | Huawei | Da Vinci NPU (Kirin chips) | ## Microsoft's Copilot+ PC Threshold Microsoft set a 40 TOPS minimum for "Copilot+" certification — a marker for laptops capable of running on-device AI features fluidly. This drove NPU adoption across the PC industry in 2024-2025. ## Why NPUs Matter for Browser ML [[WebNN API]] is designed to expose NPU acceleration to web apps. When a model runs through WebNN with an NPU execution provider, inference can be: - 5-10× faster than CPU - 5-10× more power-efficient than GPU - Free up GPU for graphics This is what makes [[Browser-Provided Language Models]] like [[Gemini Nano]] viable on consumer hardware. ## Programming Model Developers rarely program NPUs directly. Access is through: - OS-level frameworks (Apple Core ML, Windows DirectML, Android NNAPI) - Runtimes ([[ONNX Runtime Web]], TensorFlow Lite, PyTorch Mobile) - Browser APIs ([[WebNN API]]) The runtime translates the high-level model graph into NPU-specific kernels. ## Trade-offs **Strengths:** efficiency, latency, battery life, privacy-by-locality **Weaknesses:** programmability gap (vendor-specific), supports a subset of ops, often quantized-only ## References - https://en.wikipedia.org/wiki/AI_accelerator ## Related - [[Neural Networks (NNs)]] - [[Machine Learning (ML)]] - [[AI Inference]] - [[WebNN API]] - [[On-Device Machine Learning]] - [[Edge AI]] - [[Browser-Provided Language Models]] - [[AI Quantization]] - [[Apple Neural Engine]] - [[Apple Intelligence]] - [[Apple Core ML]] - [[Windows Copilot Runtime]]