Speech-to-Text (STT) - DeveloPassion

# Speech-to-Text (STT) Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), is technology that converts spoken audio into written text. Modern STT systems use deep learning to achieve high accuracy across languages, accents, and noisy environments. ## How it works A typical neural STT pipeline: 1. **Audio preprocessing**: Raw audio is converted into a spectral representation (e.g., log-Mel spectrogram) 2. **Acoustic model**: A neural network (Transformer, Conformer, or RNN-Transducer) maps audio features to linguistic units 3. **Language model / decoder**: Converts acoustic predictions into coherent text, applying language-level constraints 4. **Post-processing**: Punctuation, capitalization, formatting, and optionally speaker diarization and timestamps ## Approaches - **Encoder-decoder (seq2seq)**: Used by [[Whisper]]. Processes fixed-length audio chunks through an encoder, then decodes text autoregressively. Accurate but not natively streaming - **RNN-Transducer / TDT**: Used by [[Parakeet V3]]. Enables streaming recognition with low latency, well-suited for real-time applications - **CTC (Connectionist Temporal Classification)**: Simpler alignment-free approach, often combined with external language models ## Key capabilities in modern models - Multilingual recognition (99 languages in Whisper, 25 in Parakeet V3) - Speaker diarization (identifying who said what) - Timestamps at word or segment level - Robustness to background noise and accents - Long-form audio (up to 3 hours in Parakeet V3, 60 minutes in VibeVoice-ASR) ## Notable open-source models (2026) - [[Whisper]] (OpenAI): 94k+ stars, 99 languages, encoder-decoder Transformer - [[Parakeet V3]] (NVIDIA): 2000x+ real-time speed, 25 languages, RNN-Transducer - VibeVoice-ASR ([[VibeVoice]], Microsoft): 60-minute long-form with structured output (speaker, timestamps, content) - Canary (NVIDIA): Multilingual, part of NeMo framework ## Applications - Voice assistants and dictation - Meeting transcription and note-taking - Subtitles and closed captioning - Call center analytics - Accessibility - Podcast and video transcription ## References - Wikipedia: https://en.wikipedia.org/wiki/Speech_recognition ## Related - [[Text-to-Speech (TTS)]] - [[Whisper]] - [[Parakeet V3]] - [[VibeVoice]] - [[Voice Clone Studio]]