# Gemini 3.1 Flash TTS
Google's latest [[Text-to-Speech (TTS)]] model (April 2026), part of the [[Gemini]] family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements [[Gemini 3.1 Flash Live]] (real-time dialogue) with higher-quality, more controllable offline generation.
## Key Capabilities
- **Audio tags for style control**: natural-language inline tags (e.g. `[whispers]`, `[laughs]`, `[excited]`, `[sighs]`, `[sarcastic]`, `[very fast]`) give granular control over delivery, tone, pace, and non-verbal sounds. Tags can be combined and mixed mid-sentence; no exhaustive list, experimentation encouraged.
- **Creative expressivity**: supports stylistic directives like `[like a cartoon dog]` or `[like dracula]`; scene direction and speaker-level audio profiles.
- **Multi-speaker dialogue**: native support for multi-speaker conversations with distinct voices.
- **Broad language coverage**: 70+ languages. For non-English transcripts, English tags recommended.
- **Quality**: 1,211 Elo on the Artificial Analysis TTS leaderboard; positioned in the attractive quadrant for quality-vs-cost.
- **SynthID watermarking**: all generated audio is invisibly watermarked to flag AI-generated content.
## Availability (April 15, 2026)
- **Developers**: preview via Gemini API and [[Google AI Studio]] (configurable controls with exportable API code).
- **Enterprises**: preview on Google Vertex AI.
- **Consumers**: available via Google Vids for Workspace users.
## Notable Observations
- [[Simon Willison]] flagged the prompting guide as "surprising" — effective prompts can span hundreds of words, specifying accent, emotional shading, even "the grin in the audio".
- Accent control is prompt-driven; switching between UK regions (London, Newcastle, Exeter) in the same base prompt produces distinct regional deliveries.
- Willison built an interactive playground at tools.simonwillison.net for multi-speaker experimentation.
## Why It Matters
Moves TTS from "read this text" to "perform this text". Unlocks voice-first agents, character-driven audio, multi-speaker narration, and podcast-style content without manual voice acting; also accelerates plausible audio impersonation (mitigated partially by SynthID).
## References
- Announcement: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/
- Transcript tags docs: https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags
- Simon Willison's notes: https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts
- Playground: https://tools.simonwillison.net/
## Related
- [[Gemini]]
- [[Gemini 3.1 Flash Live]]
- [[Text-to-Speech (TTS)]]
- [[Google AI Studio]]
- [[Voice Cloning]]
- [[Simon Willison]]