Google Gemini Live
Google Gemini Live provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components. This page covers integration using the Gemini Developer API, authenticated with a Gemini API key obtained from Google AI Studio.
Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.
Sample configuration
The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.
Turn detection
For a full list of turn_detection parameters, see mllm.turn_detection. The following examples show the supported configurations for Google Gemini Live. To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.
-
Server VAD
-
Agora VAD
Key parameters
mllmrequired
- enable booleannullable
Enables the MLLM module. Replaces the deprecated
advanced_features.enable_mllm. - api_key stringrequired
The Google Gemini API key used to authenticate requests. You can generate an API key in Google AI Studio.
- messages array[object]nullable
An array of conversation history items passed to the model as context. Each item represents a single message in the conversation history.
- params objectrequired
Configuration object for the Gemini Live model.
Show propertiesHide properties
- model stringrequired
The Gemini Live model identifier.
- instructions stringnullable
System instructions that define the agent's behavior or tone.
- voice stringnullable
The voice identifier for audio output. For example,
Aoede,Puck,Charon,Kore,Fenrir,Leda,Orus, orZephyr. - affective_dialog booleannullable
Whether to enable affective dialog, which allows the model to adapt its tone based on the user's emotional cues.
- proactive_audio booleannullable
When enabled, the model may choose not to respond if the user's input does not require a reply, such as background speech or incomplete requests.
- transcribe_agent booleannullable
Whether to transcribe the agent's speech in real time.
- transcribe_user booleannullable
Whether to transcribe the user's speech in real time.
- http_options objectnullable
HTTP request options for the Gemini Live API.
Show propertiesHide properties
- api_version stringnullable
The API version to use. For example,
v1beta.
- turn_detection objectnullable
Turn detection configuration for the MLLM module.
infoWhen
mllm.turn_detectionis defined, the top-levelturn_detectionobject has no effect.Show propertiesHide properties
- mode stringnullable
Possible values:
agora_vad,server_vad,semantic_vadagora_vad: Agora VAD-based detection.server_vad: Vendor-side VAD-based detection.semantic_vad: Semantic-based detection.
- agora_vad_config objectnullable
Configuration for Agora VAD-based turn detection. Applicable when
modeisagora_vad.Show propertiesHide properties
- interrupt_duration_ms integernullable
Minimum duration of speech in milliseconds required to trigger an interruption.
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- threshold numbernullable
VAD sensitivity threshold. A higher value reduces false positives.
- server_vad_config objectnullable
Configuration for vendor-side VAD-based turn detection. Applicable when
modeisserver_vad. Parameters are passed through to the vendor.Show propertiesHide properties
- prefix_padding_ms integernullable
Duration of audio in milliseconds to include before the detected speech start.
- silence_duration_ms integernullable
Duration of silence in milliseconds required to determine end of speech.
- start_of_speech_sensitivity stringnullable
Possible values:
START_SENSITIVITY_HIGH,START_SENSITIVITY_LOWSensitivity for start of speech detection.
- end_of_speech_sensitivity stringnullable
Possible values:
END_SENSITIVITY_HIGH,END_SENSITIVITY_LOWSensitivity for end of speech detection.
- input_modalities array[string]nullable
Default:
["audio"]Input modalities for the MLLM.
["audio"]: Audio-only input["audio", "text"]: Accept both audio and text input
- output_modalities array[string]nullable
Default:
["audio"]Output modalities for the MLLM.
["audio"]: Audio-only output["text", "audio"]: Combined text and audio output
- greeting_message stringnullable
The message the agent speaks when a user joins the channel.
- failure_message stringnullable
The message the agent speaks when an error occurs.
- vendor stringrequired
The MLLM provider identifier. Set to
"gemini"to use Google Gemini Live with the Gemini Developer API.
For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the Google Gemini Live API.