# Whisper Whisper is an open-source automatic speech recognition (ASR) system by [[OpenAI]]. Licensed under MIT, with 94k+ GitHub stars. Trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture is an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and passed into the encoder. It is a multitask model that can perform speech recognition, speech translation, and language identification. ## Key features - Supports 99 languages (about 1/3 of training data is multilingual) - Robust to accents, background noise, and technical language (50% fewer errors than specialized models) - Multitask: transcription, translation to English, language identification - Available locally (open-source) or via the OpenAI API (`whisper-1`) ## Model sizes | Model | Parameters | English-only | Multilingual | |-------|-----------|:---:|:---:| | tiny | 39M | Yes | Yes | | base | 74M | Yes | Yes | | small | 244M | Yes | Yes | | medium | 769M | Yes | Yes | | large-v3 | 1.55B | No | Yes | ## Usage Can be run locally via Python, called through the OpenAI API, or used via third-party integrations. The API endpoint also supports diarization (speaker labels) with newer transcription models. ## References - Announcement: https://openai.com/index/whisper/ - Source code: https://github.com/openai/whisper - Paper: https://cdn.openai.com/papers/whisper.pdf - HuggingFace (large-v3): https://huggingface.co/openai/whisper-large-v3 - API docs: https://platform.openai.com/docs/guides/speech-to-text ## Related - [[OpenAI]] - [[Speech-to-Text (STT)]] - [[Text-to-Speech (TTS)]] - [[Voice Clone Studio]] - [[Parakeet V3]]