Gemini Live API Dev

Gemini Live API Dev

Use this skill when building real-time, bidirectional streaming applications with the Gemini Live API. Covers WebSocket-based audio/video/text

Category: design Source: google-gemini/gemini-skills

What Is This?

Overview

The Gemini Live API is a WebSocket-based interface provided by Google that enables real-time, bidirectional streaming communication with Gemini AI models. Unlike standard request-response API patterns, the Live API maintains a persistent connection that allows continuous exchange of audio, video, and text data between a client application and the model. This architecture makes it possible to build interactive, low-latency experiences where the model can listen, respond, and react dynamically as input streams in.

Developers working with the Live API interact with it through two primary SDKs: the Python google-genai library and the JavaScript/TypeScript @google/genai package. Both SDKs abstract the underlying WebSocket protocol while exposing the full range of Live API capabilities, including voice activity detection, function calling, session management, and native audio output. The API is designed for production-grade applications that require sustained, real-time AI interaction rather than isolated query-and-response cycles.

This skill covers the complete development workflow for building Live API applications, from establishing initial WebSocket sessions to handling advanced configuration options such as ephemeral tokens for secure client-side authentication. It addresses the practical patterns and pitfalls developers encounter when working with streaming AI interfaces at scale.

Who Should Use This

  • Backend engineers building voice assistant services or real-time transcription pipelines that require persistent AI model connections
  • Frontend developers creating browser-based conversational interfaces using the JavaScript/TypeScript SDK with audio and video input
  • Mobile and web application developers who need to integrate live AI responses into user-facing products with minimal latency
  • AI/ML engineers prototyping multimodal streaming applications that combine audio, video, and text inputs simultaneously
  • Platform architects designing secure client-side authentication flows using ephemeral tokens for Live API access
  • Full-stack developers building interactive tutoring, customer support, or accessibility tools that depend on continuous AI interaction

Why Use It?

Problems It Solves

  • Standard REST API calls introduce round-trip latency that makes real-time conversational AI feel unresponsive. The Live API eliminates this by maintaining an open WebSocket connection throughout a session.
  • Managing audio input and output manually requires complex buffering and synchronization logic. The Live API handles voice activity detection natively, reducing the engineering burden on developers.
  • Securely exposing API credentials in client-side applications is a significant security risk. Ephemeral tokens provide a scoped, time-limited authentication mechanism that avoids embedding long-lived credentials in browser or mobile code.
  • Building function calling into streaming contexts requires careful state management. The Live API provides structured hooks for invoking external tools mid-conversation without interrupting the stream.

Core Highlights

  • Persistent WebSocket connections for low-latency, bidirectional communication
  • Native support for audio, video, and text streaming within a single session
  • Built-in voice activity detection that automatically segments speech input
  • Function calling support for integrating external tools and APIs during live sessions
  • Ephemeral token generation for secure, client-side API authentication
  • Session management controls for configuring context, turn-taking, and timeouts
  • Full SDK support for both Python and JavaScript/TypeScript environments
  • Configurable audio output with native speech synthesis from the model

How to Use It?

Basic Usage

To establish a Live API session in Python, use the google-genai SDK as follows:

import asyncio
from google import genai

client = genai.Client()

async def main():
    async with client.aio.live.connect(model="gemini-2.0-flash-live-001") as session:
        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": "Hello, can you hear me?"}]},
            turn_complete=True
        )
        async for response in session.receive():
            if response.text:
                print(response.text)

asyncio.run(main())

In JavaScript/TypeScript, the equivalent connection looks like this:

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({ apiKey: process.env.API_KEY });

const session = await client.live.connect({
  model: "gemini-2.0-flash-live-001",
  callbacks: {
    onmessage: (message) => console.log(message),
  },
});

session.sendClientContent({ turns: "Hello from the browser." });

Specific Scenarios

Scenario 1: Streaming audio input with voice activity detection. When building a voice assistant, send raw audio chunks to the session using send_realtime_input and configure VAD sensitivity in the session setup. The model will automatically detect speech boundaries and respond at natural turn endings.

Scenario 2: Function calling during a live session. Register tool definitions in the session configuration. When the model determines a function call is needed, it emits a tool_call event. Your application handles the call, returns the result via send_tool_response, and the session continues without interruption.

Real-World Examples

A customer support platform uses the Live API to power a voice agent that listens to callers, queries a CRM via function calling, and responds with synthesized audio, all within a single persistent session.

An online language learning application streams video from a student's camera alongside audio, allowing the model to observe pronunciation and provide real-time feedback.