Overview
Google Cloud Text-to-Speech provides high-quality speech synthesis with two service implementations:GoogleTTSService (WebSocket-based) for streaming with the lowest latency, and GoogleHttpTTSService (HTTP-based) for simpler integration. GoogleTTSService is recommended for real-time applications.
Google TTS API Reference
Pipecat’s API methods for Google Cloud TTS integration
Example Implementation
Complete example with Chirp 3 HD voice
Google Cloud Documentation
Official Google Cloud Text-to-Speech documentation
Voice Gallery
Browse available voices and languages
Installation
To use Google services, install the required dependencies:Prerequisites
Google Cloud Setup
Before using Google Cloud TTS services, you need:- Google Cloud Account: Sign up at Google Cloud Console
- Project Setup: Create a project and enable the Text-to-Speech API
- Service Account: Create a service account with TTS permissions
- Authentication: Set up credentials via service account key or Application Default Credentials
Required Environment Variables
GOOGLE_APPLICATION_CREDENTIALS: Path to your service account key file (recommended)- Or use Application Default Credentials for cloud deployments
Configuration
GoogleTTSService
Streaming service optimized for Chirp 3 HD and Journey voices.JSON string containing Google Cloud service account credentials.
Path to Google Cloud service account JSON file.
Google Cloud location for regional endpoint (e.g.,
"us-central1").Google TTS voice identifier. Deprecated in v0.0.105. Use
settings=GoogleTTSService.Settings(voice=...) instead.Voice cloning key for Chirp 3 custom voices.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured
sample rate.Deprecated in v0.0.105. Use
settings=GoogleTTSService.Settings(...)
instead.Runtime-configurable settings. See GoogleTTSService
Settings below.
GoogleTTSService Settings
Runtime-configurable settings passed via thesettings constructor argument using GoogleTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model identifier. (Inherited.) |
voice | str | None | Voice identifier. (Inherited.) |
language | Language | str | None | Language for synthesis. (Inherited.) |
speaking_rate | float | NOT_GIVEN | Speaking rate in the range [0.25, 2.0]. |
GoogleHttpTTSService
HTTP service with full SSML support for all voice types.JSON string containing Google Cloud service account credentials.
Path to Google Cloud service account JSON file.
Google Cloud location for regional endpoint.
Google TTS voice identifier. Deprecated in v0.0.105. Use
settings=GoogleHttpTTSService.Settings(voice=...) instead.Output audio sample rate in Hz.
Deprecated in v0.0.105. Use
settings=GoogleHttpTTSService.Settings(...)
instead.Runtime-configurable settings. See GoogleHttpTTSService
Settings below.
GoogleHttpTTSService Settings
Runtime-configurable settings passed via thesettings constructor argument using GoogleHttpTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model identifier. (Inherited.) |
voice | str | None | Voice identifier. (Inherited.) |
language | Language | str | None | Language for synthesis. (Inherited.) |
pitch | str | NOT_GIVEN | Voice pitch adjustment (e.g., "+2st", "-50%"). |
rate | str | NOT_GIVEN | Speaking rate for SSML prosody (non-Chirp voices, e.g., "slow", "fast", "125%"). |
speaking_rate | float | NOT_GIVEN | Speaking rate for AudioConfig (Chirp/Journey voices). Range [0.25, 2.0]. |
volume | str | NOT_GIVEN | Volume adjustment (e.g., "loud", "soft", "+6dB"). |
emphasis | Literal | NOT_GIVEN | Emphasis level: "strong", "moderate", "reduced", "none". |
gender | Literal | NOT_GIVEN | Voice gender preference: "male", "female", "neutral". |
google_style | Literal | NOT_GIVEN | Google-specific voice style: "apologetic", "calm", "empathetic", "firm", "lively". |
GeminiTTSService
Streaming service using Gemini’s TTS-specific models with natural voice control. Supports two backends: the Google Cloud backend (with prompts for style instructions and multi-speaker support) or the Gemini Developer API (google-genai) backend (simpler API key authentication).Gemini TTS model to use. Options:
"gemini-3.1-flash-tts-preview",
"gemini-2.5-flash-tts", "gemini-2.5-pro-tts". Deprecated in v0.0.105. Use
settings=GeminiTTSService.Settings(model=...) instead.Google AI API key for authentication with the GenAI backend. When provided,
automatically selects the GenAI backend. Alternatively set
GOOGLE_API_KEY
environment variable.JSON string containing Google Cloud service account credentials for the Google
Cloud backend.
Path to Google Cloud service account JSON file for the Google Cloud backend.
Google Cloud location for regional endpoint (Google Cloud backend only).
Voice name from available Gemini voices (e.g.,
"Kore", "Charon", "Puck",
"Zephyr"). Deprecated in v0.0.105. Use
settings=GeminiTTSService.Settings(voice=...) instead.Output audio sample rate in Hz. Google TTS outputs at 24kHz; mismatched rates
will produce a warning.
Deprecated in v0.0.105. Use
settings=GeminiTTSService.Settings(...)
instead.Runtime-configurable settings. See GeminiTTSService
Settings below.
Force use of the google-genai backend when
True, or the Google Cloud backend
when False. If not provided, backend is selected automatically based on
whether api_key is passed.HTTP client options for the google-genai client. Only applicable when using
the GenAI backend.
GeminiTTSService Settings
Runtime-configurable settings passed via thesettings constructor argument using GeminiTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model identifier. (Inherited.) |
voice | str | None | Voice identifier. (Inherited.) |
language | Language | str | None | Language for synthesis. (Inherited.) |
prompt | str | NOT_GIVEN | Style instructions for how to synthesize the content. |
multi_speaker | bool | NOT_GIVEN | Enable multi-speaker support. |
speaker_configs | list[dict] | NOT_GIVEN | Speaker configurations for multi-speaker mode. Each dict should have speaker_alias and optionally speaker_id. |
Usage
Basic Setup (Streaming)
HTTP Service with SSML
Gemini TTS with GenAI Backend (API Key)
Gemini TTS with Google Cloud Backend (Style Prompt)
Notes
- Streaming vs HTTP:
GoogleTTSServiceuses the streaming API for low latency and only supports Chirp 3 HD and Journey voices.GoogleHttpTTSServicesupports all Google voices including Standard and WaveNet, with full SSML support. - Chirp/Journey voices and SSML: Chirp and Journey voices do not support SSML. The HTTP service automatically uses plain text input for these voices.
- Speaking rate: For Chirp and Journey voices, use
speaking_rate(float, 0.25-2.0) insettings. For other voices, userate(string) for SSML prosody control. - Gemini TTS sample rate: Google TTS always outputs at 24kHz. Setting a different sample rate will produce a warning and may cause audio issues.
- Gemini TTS backends:
GeminiTTSServicesupports two backends:- GenAI backend (google-genai): Simpler authentication with API key. Automatically selected when
api_keyis provided. Does not supportpromptormulti_speakersettings. - Google Cloud backend: Uses service account credentials. Supports
promptfor style instructions andmulti_speakerfor multi-voice conversations.
- GenAI backend (google-genai): Simpler authentication with API key. Automatically selected when
- Backend selection: Pass
api_keyto use the GenAI backend, orcredentials/credentials_pathfor Google Cloud. TheGOOGLE_API_KEYenvironment variable alone does not switch backends; it is only used once the GenAI backend is active. Useuse_genai=Trueto force the GenAI backend explicitly. - Gemini multi-speaker: Use
multi_speaker=Truewithspeaker_configsto generate conversations between multiple voices (Google Cloud backend only). Markup text with speaker aliases to control which voice speaks.