Skip to main content

Google Gemini Live

Google Gemini Live provides multimodal large language model capabilities with real-time audio processing, enabling natural voice conversations without separate ASR/TTS components. This page covers integration using the Gemini Developer API, authenticated with a Gemini API key obtained from Google AI Studio.

info

Enabling MLLM automatically disables ASR, LLM, and TTS since the MLLM handles end-to-end voice processing directly.

Sample configuration

The following example shows a starting mllm parameter configuration you can use when you Start a conversational AI agent.


_34
"mllm": {
_34
"enable": true,
_34
"api_key": "<GOOGLE_GEMINI_API_KEY>",
_34
"messages": [
_34
{
_34
"role": "user",
_34
"content": "<HISTORY_CONTENT>"
_34
}
_34
],
_34
"params": {
_34
"model": "gemini-3.1-flash-live-preview",
_34
"instructions": "You are a friendly assistant.",
_34
"voice": "Charon",
_34
"affective_dialog": false,
_34
"proactive_audio": false,
_34
"transcribe_agent": true,
_34
"transcribe_user": true,
_34
"http_options": {
_34
"api_version": "v1beta"
_34
}
_34
},
_34
"turn_detection": {
_34
// see details below
_34
},
_34
"input_modalities": [
_34
"audio"
_34
],
_34
"output_modalities": [
_34
"audio"
_34
],
_34
"greeting_message": "Hi, how can I assist you today?",
_34
"failure_message": "Sorry, I encountered an issue. Please try again.",
_34
"vendor": "gemini"
_34
}

Turn detection

For a full list of turn_detection parameters, see mllm.turn_detection. The following examples show the supported configurations for Google Gemini Live. To set up turn detection, add a turn_detection block inside the mllm object when you Start a conversational AI agent.

  • Server VAD


    _9
    "turn_detection": {
    _9
    "mode": "server_vad",
    _9
    "server_vad_config": {
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640,
    _9
    "start_of_speech_sensitivity": "START_SENSITIVITY_HIGH",
    _9
    "end_of_speech_sensitivity": "END_SENSITIVITY_HIGH"
    _9
    }
    _9
    }

  • Agora VAD


    _9
    "turn_detection": {
    _9
    "mode": "agora_vad",
    _9
    "agora_vad_config": {
    _9
    "interrupt_duration_ms": 160,
    _9
    "prefix_padding_ms": 800,
    _9
    "silence_duration_ms": 640,
    _9
    "threshold": 0.5
    _9
    }
    _9
    }

Key parameters

mllmrequired
  • enable booleannullable

    Enables the MLLM module. Replaces the deprecated advanced_features.enable_mllm.

  • api_key stringrequired

    The Google Gemini API key used to authenticate requests. You can generate an API key in Google AI Studio.

  • messages array[object]nullable

    An array of conversation history items passed to the model as context. Each item represents a single message in the conversation history.

    Show propertiesHide properties
    • role stringrequired

      The role of the message author. For example, user.

    • content stringrequired

      The content of the message.

  • params objectrequired

    Configuration object for the Gemini Live model.

    Show propertiesHide properties
    • model stringrequired

      The Gemini Live model identifier.

    • instructions stringnullable

      System instructions that define the agent's behavior or tone.

    • voice stringnullable

      The voice identifier for audio output. For example, Aoede, Puck, Charon, Kore, Fenrir, Leda, Orus, or Zephyr.

    • affective_dialog booleannullable

      Whether to enable affective dialog, which allows the model to adapt its tone based on the user's emotional cues.

    • proactive_audio booleannullable

      When enabled, the model may choose not to respond if the user's input does not require a reply, such as background speech or incomplete requests.

    • transcribe_agent booleannullable

      Whether to transcribe the agent's speech in real time.

    • transcribe_user booleannullable

      Whether to transcribe the user's speech in real time.

    • http_options objectnullable

      HTTP request options for the Gemini Live API.

      Show propertiesHide properties
      • api_version stringnullable

        The API version to use. For example, v1beta.

  • turn_detection objectnullable

    Turn detection configuration for the MLLM module.

    info

    When mllm.turn_detection is defined, the top-level turn_detection object has no effect.

    Show propertiesHide properties
    • mode stringnullable

      Possible values: agora_vad, server_vad, semantic_vad

      • agora_vad: Agora VAD-based detection.
      • server_vad: Vendor-side VAD-based detection.
      • semantic_vad: Semantic-based detection.
    • agora_vad_config objectnullable

      Configuration for Agora VAD-based turn detection. Applicable when mode is agora_vad.

      Show propertiesHide properties
      • interrupt_duration_ms integernullable

        Minimum duration of speech in milliseconds required to trigger an interruption.

      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • threshold numbernullable

        VAD sensitivity threshold. A higher value reduces false positives.

    • server_vad_config objectnullable

      Configuration for vendor-side VAD-based turn detection. Applicable when mode is server_vad. Parameters are passed through to the vendor.

      Show propertiesHide properties
      • prefix_padding_ms integernullable

        Duration of audio in milliseconds to include before the detected speech start.

      • silence_duration_ms integernullable

        Duration of silence in milliseconds required to determine end of speech.

      • start_of_speech_sensitivity stringnullable

        Possible values: START_SENSITIVITY_HIGH, START_SENSITIVITY_LOW

        Sensitivity for start of speech detection.

      • end_of_speech_sensitivity stringnullable

        Possible values: END_SENSITIVITY_HIGH, END_SENSITIVITY_LOW

        Sensitivity for end of speech detection.

  • input_modalities array[string]nullable

    Default: ["audio"]

    Input modalities for the MLLM.

    • ["audio"]: Audio-only input
    • ["audio", "text"]: Accept both audio and text input
  • output_modalities array[string]nullable

    Default: ["audio"]

    Output modalities for the MLLM.

    • ["audio"]: Audio-only output
    • ["text", "audio"]: Combined text and audio output
  • greeting_message stringnullable

    The message the agent speaks when a user joins the channel.

  • failure_message stringnullable

    The message the agent speaks when an error occurs.

  • vendor stringrequired

    The MLLM provider identifier. Set to "gemini" to use Google Gemini Live with the Gemini Developer API.

For comprehensive API reference, real-time capabilities, and detailed parameter descriptions, see the Google Gemini Live API.