Gemini 3.1 Flash TTS - DeveloPassion

# Gemini 3.1 Flash TTS Google's latest [[Text-to-Speech (TTS)]] model (April 2026), part of the [[Gemini]] family. Focused on controllability, expressivity, and quality for developers and enterprises building speech applications. Complements [[Gemini 3.1 Flash Live]] (real-time dialogue) with higher-quality, more controllable offline generation. ## Key Capabilities - **Audio tags for style control**: natural-language inline tags (e.g. `[whispers]`, `[laughs]`, `[excited]`, `[sighs]`, `[sarcastic]`, `[very fast]`) give granular control over delivery, tone, pace, and non-verbal sounds. Tags can be combined and mixed mid-sentence; no exhaustive list, experimentation encouraged. - **Creative expressivity**: supports stylistic directives like `[like a cartoon dog]` or `[like dracula]`; scene direction and speaker-level audio profiles. - **Multi-speaker dialogue**: native support for multi-speaker conversations with distinct voices. - **Broad language coverage**: 70+ languages. For non-English transcripts, English tags recommended. - **Quality**: 1,211 Elo on the Artificial Analysis TTS leaderboard; positioned in the attractive quadrant for quality-vs-cost. - **SynthID watermarking**: all generated audio is invisibly watermarked to flag AI-generated content. ## Availability (April 15, 2026) - **Developers**: preview via Gemini API and [[Google AI Studio]] (configurable controls with exportable API code). - **Enterprises**: preview on Google Vertex AI. - **Consumers**: available via Google Vids for Workspace users. ## Notable Observations - [[Simon Willison]] flagged the prompting guide as "surprising" — effective prompts can span hundreds of words, specifying accent, emotional shading, even "the grin in the audio". - Accent control is prompt-driven; switching between UK regions (London, Newcastle, Exeter) in the same base prompt produces distinct regional deliveries. - Willison built an interactive playground at tools.simonwillison.net for multi-speaker experimentation. ## Why It Matters Moves TTS from "read this text" to "perform this text". Unlocks voice-first agents, character-driven audio, multi-speaker narration, and podcast-style content without manual voice acting; also accelerates plausible audio impersonation (mitigated partially by SynthID). ## References - Announcement: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/ - Transcript tags docs: https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags - Simon Willison's notes: https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts - Playground: https://tools.simonwillison.net/ ## Related - [[Gemini]] - [[Gemini 3.1 Flash Live]] - [[Text-to-Speech (TTS)]] - [[Google AI Studio]] - [[Voice Cloning]] - [[Simon Willison]]