# Whisper
Whisper is an open-source automatic speech recognition (ASR) system by [[OpenAI]]. Licensed under MIT, with 94k+ GitHub stars. Trained on 680,000 hours of multilingual and multitask supervised data collected from the web.
The architecture is an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and passed into the encoder. It is a multitask model that can perform speech recognition, speech translation, and language identification.
## Key features
- Supports 99 languages (about 1/3 of training data is multilingual)
- Robust to accents, background noise, and technical language (50% fewer errors than specialized models)
- Multitask: transcription, translation to English, language identification
- Available locally (open-source) or via the OpenAI API (`whisper-1`)
## Model sizes
| Model | Parameters | English-only | Multilingual |
|-------|-----------|:---:|:---:|
| tiny | 39M | Yes | Yes |
| base | 74M | Yes | Yes |
| small | 244M | Yes | Yes |
| medium | 769M | Yes | Yes |
| large-v3 | 1.55B | No | Yes |
## Usage
Can be run locally via Python, called through the OpenAI API, or used via third-party integrations. The API endpoint also supports diarization (speaker labels) with newer transcription models.
## References
- Announcement: https://openai.com/index/whisper/
- Source code: https://github.com/openai/whisper
- Paper: https://cdn.openai.com/papers/whisper.pdf
- HuggingFace (large-v3): https://huggingface.co/openai/whisper-large-v3
- API docs: https://platform.openai.com/docs/guides/speech-to-text
## Related
- [[OpenAI]]
- [[Speech-to-Text (STT)]]
- [[Text-to-Speech (TTS)]]
- [[Voice Clone Studio]]
- [[Parakeet V3]]