Feat(ai-gemini): Gemini Realtime Adapter#405
Feat(ai-gemini): Gemini Realtime Adapter#405nikas-belogolov wants to merge 8 commits intoTanStack:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds Gemini realtime support: new ai-gemini realtime token/adapter/types, exports, client-side integration in examples, updates shared realtime APIs (connect signature, session config), and small package.json and openai/elevenlabs adapter adjustments to accept session config. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Browser Client
participant Server as App Server
participant GeminiAuth as Google GenAI Auth
participant GeminiLive as Google GenAI Live
participant AudioHW as Microphone/Audio
Client->>Server: request realtime token (provider=gemini, options)
Server->>GeminiAuth: client.authTokens.create(...) (model, expireTime, modalities)
GeminiAuth-->>Server: ephemeral token (name, expireTime)
Server-->>Client: RealtimeToken (token, expiresAt, config)
Client->>GeminiLive: ai.live.connect(token, RealtimeSessionConfig)
GeminiLive-->>Client: session open / status events
AudioHW->>Client: capture PCM frames (AudioWorklet)
Client->>GeminiLive: sendRealtimeInput({ audio: { data, mimeType } })
GeminiLive-->>Client: transcript / tool_call / message parts (events)
Client->>Client: assemble message, emit message_complete
GeminiLive-->>Client: binary audio chunks
Client->>AudioHW: decode & schedule playback (AudioBufferSourceNode)
Client->>GeminiLive: sendText / sendImage / sendToolResult / interrupt
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There should be some discussion around how to implement the websocket connection, as there could be client-server (straight to the realtime API) connection, or client-server-server, with the web server (e.g. tanstack start) being a proxy |
|
| Command | Status | Duration | Result |
|---|---|---|---|
nx affected --targets=test:sherif,test:knip,tes... |
❌ Failed | 2m 47s | View ↗ |
nx run-many --targets=build --exclude=examples/** |
❌ Failed | 48s | View ↗ |
☁️ Nx Cloud last updated this comment at 2026-03-29 10:28:30 UTC
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/typescript/ai-client/src/realtime-client.ts (1)
532-540:⚠️ Potential issue | 🟡 MinorInconsistent tool mapping:
applySessionConfigomitsoutputSchema.The tool mapping in
connect()(lines 104-114) includes bothinputSchemaandoutputSchema, butapplySessionConfig()only includesinputSchema. This inconsistency could lead to missing output schemas when the session is updated after initial connection.🐛 Proposed fix for consistency
const toolsConfig = tools ? Array.from(this.clientTools.values()).map((t) => ({ name: t.name, description: t.description, inputSchema: t.inputSchema ? convertSchemaToJsonSchema(t.inputSchema) : undefined, + outputSchema: t.outputSchema + ? convertSchemaToJsonSchema(t.outputSchema) + : undefined, })) : undefined🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-client/src/realtime-client.ts` around lines 532 - 540, The applySessionConfig mapping for tools omits outputSchema, causing inconsistency with connect(); update applySessionConfig (the code building toolsConfig from this.clientTools.values()) to include outputSchema the same way inputSchema is handled by calling convertSchemaToJsonSchema on t.outputSchema when present, so toolsConfig contains both inputSchema and outputSchema (mirroring the mapping in connect()).
🧹 Nitpick comments (3)
packages/typescript/ai-client/src/realtime-client.ts (1)
99-102: Remove commented-out code.This dead code should be removed to keep the codebase clean.
🧹 Proposed removal
- // const toolsList = - // this.clientTools.size > 0 - // ? Array.from(this.clientTools.values()) - // : undefined -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-client/src/realtime-client.ts` around lines 99 - 102, Remove the dead/commented-out block that defines toolsList in realtime-client.ts (the lines referencing toolsList and this.clientTools) — delete the commented code entirely so no leftover commented declarations remain; ensure there are no other references to the removed snippet (search for toolsList and usages of this.clientTools in the surrounding code) and run lint/format to keep the file clean.packages/typescript/ai-gemini/src/index.ts (1)
86-90: Consider exporting Gemini realtime types from the main entry point.The
./realtime/indexmodule exports types (GeminiRealtimeModel,GeminiRealtimeTokenOptions,GeminiRealtimeOptions) that aren't re-exported here. Other adapters (text, summarize, image, tts) export their configuration types from the main entry point for consumer convenience.♻️ Proposed addition for type exports
// Realtime adapter export { geminiRealtime, geminiRealtimeToken, } from './realtime/index' +export type { + GeminiRealtimeModel, + GeminiRealtimeTokenOptions, + GeminiRealtimeOptions, +} from './realtime/index'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-gemini/src/index.ts` around lines 86 - 90, Export the realtime types from the main entry point so consumers can import them alongside geminiRealtime and geminiRealtimeToken; add a type-only re-export like "export type { GeminiRealtimeModel, GeminiRealtimeTokenOptions, GeminiRealtimeOptions } from './realtime/index'" in the same file that currently exports geminiRealtime and geminiRealtimeToken. Ensure you use a type-only export to avoid runtime bundle changes and reference the exact type names GeminiRealtimeModel, GeminiRealtimeTokenOptions, and GeminiRealtimeOptions so IDEs and consumers can import them directly.examples/ts-react-chat/src/routes/realtime.tsx (1)
279-293: Consider enabling additional configuration options for Gemini.The tools indicator now correctly appears for Gemini. However, other session configuration options (output mode, temperature, semantic eagerness) remain OpenAI-only in the UI, but the underlying
RealtimeSessionConfigsupports these for all providers. Consider enabling some of these controls for Gemini if the Gemini Live API supports them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/ts-react-chat/src/routes/realtime.tsx` around lines 279 - 293, The UI currently gates several session configuration controls to provider === 'openai'; update those conditionals to also include 'gemini' so that controls for output mode, temperature, and semanticEagerness are shown when provider === 'gemini' (same as how you added the tools indicator for Gemini). Locate the conditional checks and UI blocks in the realtime component that reference provider (and the controls bound to RealtimeSessionConfig such as outputMode, temperature, semanticEagerness) and extend their logic to allow 'gemini'; ensure the form bindings still map to RealtimeSessionConfig fields so changes propagate to the session payload.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/ts-react-chat/src/lib/use-realtime.ts`:
- Around line 65-69: The Gemini branch is using shared OpenAI defaults (voice:
'alloy' and unconstrained responseModalities) which Gemini Live rejects; update
the geminiRealtime() path so it supplies Gemini-compatible defaults: set voice
to a valid Gemini prebuilt name (e.g., 'Puck' or another supported name) instead
of 'alloy', and set responseModalities to an explicit single-element array
(e.g., [Modality.AUDIO]) for the session; ensure geminiRealtime (or its caller
in useRealtime) overrides the shared defaults and enforces only one modality so
the Gemini connection will succeed.
In `@packages/typescript/ai-client/src/realtime-types.ts`:
- Around line 28-34: The ElevenLabs realtime adapter's connect function still
uses the old signature with clientToolDefs?: ReadonlyArray<AnyClientTool>, which
no longer matches the RealtimeAdapter.connect signature (token: RealtimeToken,
config: RealtimeSessionConfig) and causes type errors; update the connect method
in packages/typescript/ai-elevenlabs/src/realtime/adapter.ts (the function named
connect or the class implementing RealtimeAdapter) to accept (token:
RealtimeToken, config: RealtimeSessionConfig) and remove the legacy
clientToolDefs parameter, update any internal usages/calls within that adapter
to read configuration from the provided RealtimeSessionConfig, and ensure the
exported adapter type still satisfies RealtimeAdapter.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts`:
- Around line 96-112: liveConfig is missing transcription flags so Gemini won't
populate inputTranscription/outputTranscription; update the LiveConnectConfig
object built in the liveConfig variable to include inputAudioTranscription: {}
and outputAudioTranscription: {} (use the same config.providerOptions merge
pattern so you don't overwrite existing keys) so the API will emit transcription
data that the adapter reads when handling inputTranscription/outputTranscription
events.
- Around line 466-473: The sendImage function currently calls
session.sendRealtimeInput with a non-supported media property; update sendImage
to pass the image under the modality-specific video field instead of media or
deprecated mediaChunks (e.g., call session.sendRealtimeInput({ video: { /*
include the image bytes and mimeType as the frame payload */ } })). Locate
sendImage and replace the media object with a video object containing the image
data and mimeType in the API's expected frame/payload shape so the
session.sendRealtimeInput call uses video rather than media or mediaChunks.
- Around line 179-185: The convertFloat32ToInt16 function currently returns
buf.toString() (comma-separated integers); change it to produce a base64-encoded
string of the raw 16-bit PCM bytes suitable for the Gemini Live API audio.data
field (e.g., MIME audio/pcm;rate=16000). Convert the Float32Array to an
Int16Array (clamping and scaling as done now), then create a Uint8Array view
over the Int16Array's buffer (ensuring correct endianness), encode that byte
array to base64, and return the base64 string so audio.data contains raw PCM
bytes rather than CSV integers.
- Around line 188-190: The GoogleGenAI client is being constructed with an
ephemeral token (token.token) but missing the v1alpha API version; update the
initialization of GoogleGenAI (the ai instance) to include httpOptions: {
apiVersion: 'v1alpha' } when using ephemeral tokens so Live API requests use the
v1alpha endpoint, e.g., detect where new GoogleGenAI({ apiKey: token.token }) is
created and add the httpOptions.apiVersion flag.
In `@packages/typescript/ai-gemini/src/realtime/token.ts`:
- Around line 41-42: The default model string is invalid for the
GeminiRealtimeModel union: change the default assigned to the local variable
model (currently set via options.model ?? 'gemini-live-2.5-flash-native-audio')
to one of the allowed union values (e.g., 'gemini-3.1-flash-live-preview' or
'gemini-2.5-flash-native-audio-preview-12-2025') so the inferred type matches
GeminiRealtimeModel; update the expression using options.model to fall back to a
valid GeminiRealtimeModel literal.
---
Outside diff comments:
In `@packages/typescript/ai-client/src/realtime-client.ts`:
- Around line 532-540: The applySessionConfig mapping for tools omits
outputSchema, causing inconsistency with connect(); update applySessionConfig
(the code building toolsConfig from this.clientTools.values()) to include
outputSchema the same way inputSchema is handled by calling
convertSchemaToJsonSchema on t.outputSchema when present, so toolsConfig
contains both inputSchema and outputSchema (mirroring the mapping in connect()).
---
Nitpick comments:
In `@examples/ts-react-chat/src/routes/realtime.tsx`:
- Around line 279-293: The UI currently gates several session configuration
controls to provider === 'openai'; update those conditionals to also include
'gemini' so that controls for output mode, temperature, and semanticEagerness
are shown when provider === 'gemini' (same as how you added the tools indicator
for Gemini). Locate the conditional checks and UI blocks in the realtime
component that reference provider (and the controls bound to
RealtimeSessionConfig such as outputMode, temperature, semanticEagerness) and
extend their logic to allow 'gemini'; ensure the form bindings still map to
RealtimeSessionConfig fields so changes propagate to the session payload.
In `@packages/typescript/ai-client/src/realtime-client.ts`:
- Around line 99-102: Remove the dead/commented-out block that defines toolsList
in realtime-client.ts (the lines referencing toolsList and this.clientTools) —
delete the commented code entirely so no leftover commented declarations remain;
ensure there are no other references to the removed snippet (search for
toolsList and usages of this.clientTools in the surrounding code) and run
lint/format to keep the file clean.
In `@packages/typescript/ai-gemini/src/index.ts`:
- Around line 86-90: Export the realtime types from the main entry point so
consumers can import them alongside geminiRealtime and geminiRealtimeToken; add
a type-only re-export like "export type { GeminiRealtimeModel,
GeminiRealtimeTokenOptions, GeminiRealtimeOptions } from './realtime/index'" in
the same file that currently exports geminiRealtime and geminiRealtimeToken.
Ensure you use a type-only export to avoid runtime bundle changes and reference
the exact type names GeminiRealtimeModel, GeminiRealtimeTokenOptions, and
GeminiRealtimeOptions so IDEs and consumers can import them directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 26ce7bc3-6e10-47ce-adc7-516c0d88c602
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (13)
.changeset/huge-lizards-admire.mdexamples/ts-react-chat/src/lib/use-realtime.tsexamples/ts-react-chat/src/routes/realtime.tsxpackages/typescript/ai-client/src/realtime-client.tspackages/typescript/ai-client/src/realtime-types.tspackages/typescript/ai-gemini/package.jsonpackages/typescript/ai-gemini/src/index.tspackages/typescript/ai-gemini/src/realtime/adapter.tspackages/typescript/ai-gemini/src/realtime/index.tspackages/typescript/ai-gemini/src/realtime/token.tspackages/typescript/ai-gemini/src/realtime/types.tspackages/typescript/ai-openai/src/realtime/adapter.tspackages/typescript/ai/src/realtime/types.ts
| function convertFloat32ToInt16(buffer: Float32Array) { | ||
| let l = buffer.length; | ||
| const buf = new Int16Array(l); | ||
| while (l--) { | ||
| buf[l] = Math.min(1, Math.max(-1, buffer[l]!)) * 0x7fff; | ||
| } | ||
| return buf.toString(); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
@google/genai session.sendRealtimeInput audio.data expected format for Gemini Live JavaScript
💡 Result:
The expected format for audio.data in session.sendRealtimeInput({ audio: { data: ..., mimeType: ... } }) for the @google/genai JavaScript SDK (Gemini Live) is a base64-encoded string of raw 16-bit PCM audio data (little-endian, typically 16kHz sample rate). Usage example: session.sendRealtimeInput({ audio: { data: audioChunk.toString('base64'), // audioChunk is a Buffer/ArrayBuffer/Uint8Array of raw PCM bytes mimeType: 'audio/pcm;rate=16000' } }); Key details: - Audio format: Raw PCM, 16-bit signed integers, little-endian, mono. - Sample rate: 16kHz native (API resamples if needed). - mimeType: Must include rate, e.g., 'audio/pcm;rate=16000'. - Send in small chunks for real-time streaming. This matches official Google AI for Developers documentation and SDK source patterns.
Citations:
- 1: https://ai.google.dev/gemini-api/docs/live-guide
- 2: https://ai.google.dev/gemini-api/docs/live-api/capabilities
- 3: https://github.com/googleapis/js-genai/blob/89400098/src/live.ts
- 4: https://tessl.io/registry/tessl/npm-google--genai/1.30.0/docs/live.md
🏁 Script executed:
# First, let's check if the file exists and read the relevant section
cat -n packages/typescript/ai-gemini/src/realtime/adapter.ts | sed -n '170,200p'Repository: TanStack/ai
Length of output: 981
🏁 Script executed:
# Search for usages of convertFloat32ToInt16 in the codebase
rg "convertFloat32ToInt16" --type ts --type tsx -B 2 -A 5Repository: TanStack/ai
Length of output: 82
🏁 Script executed:
# Check for any other audio-related functions in the adapter file
rg "audio|pcm|Audio|PCM" packages/typescript/ai-gemini/src/realtime/adapter.ts -nRepository: TanStack/ai
Length of output: 3431
Encode microphone PCM as base64-encoded raw bytes, not comma-separated integers.
Int16Array.prototype.toString() returns a comma-separated decimal string like "12,-44,9", but the Gemini Live API expects the audio.data field to contain a base64-encoded string of raw 16-bit PCM bytes with a MIME type like audio/pcm;rate=16000. Passing the current comma-separated format will cause the audio payload to be rejected or misinterpreted server-side, preventing microphone input from decoding correctly.
🐛 Proposed fix
function convertFloat32ToInt16(buffer: Float32Array) {
- let l = buffer.length;
- const buf = new Int16Array(l);
- while (l--) {
- buf[l] = Math.min(1, Math.max(-1, buffer[l]!)) * 0x7fff;
- }
- return buf.toString();
+ const pcm = new Int16Array(buffer.length)
+ for (let i = 0; i < buffer.length; i++) {
+ pcm[i] = Math.min(1, Math.max(-1, buffer[i]!)) * 0x7fff
+ }
+
+ const bytes = new Uint8Array(pcm.buffer)
+ let binary = ''
+ for (const byte of bytes) {
+ binary += String.fromCharCode(byte)
+ }
+ return btoa(binary)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 179 -
185, The convertFloat32ToInt16 function currently returns buf.toString()
(comma-separated integers); change it to produce a base64-encoded string of the
raw 16-bit PCM bytes suitable for the Gemini Live API audio.data field (e.g.,
MIME audio/pcm;rate=16000). Convert the Float32Array to an Int16Array (clamping
and scaling as done now), then create a Uint8Array view over the Int16Array's
buffer (ensuring correct endianness), encode that byte array to base64, and
return the base64 string so audio.data contains raw PCM bytes rather than CSV
integers.
| const ai = new GoogleGenAI({ | ||
| apiKey: token.token | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does @google/genai require httpOptions.apiVersion = 'v1alpha' when connecting to Gemini Live with an ephemeral token?
💡 Result:
Yes, @google/genai requires httpOptions.apiVersion = 'v1alpha' when connecting to Gemini Live with an ephemeral token. Ephemeral tokens are exclusively compatible with the Live API and mandate the v1alpha API version. Official documentation explicitly states that when using an ephemeral token as the apiKey for client initialization, the http_options must be set to {'api_version': 'v1alpha'}. This is also required for Live API preview features like affective dialog and proactive audio, and for creating the ephemeral tokens themselves.
Citations:
- 1: https://ai.google.dev/gemini-api/docs/live-api/capabilities
- 2: https://ai.google.dev/gemini-api/docs/live-api/ephemeral-tokens
- 3: https://ai.google.dev/gemini-api/docs/ephemeral-tokens
- 4: https://googleapis-python-genai-70.mintlify.app/guides/http-options
- 5: https://ai.google.dev/gemini-api/docs/api-versions
🏁 Script executed:
# First, let's check the file structure and the specific lines mentioned
head -n 200 packages/typescript/ai-gemini/src/realtime/adapter.ts | tail -n 20Repository: TanStack/ai
Length of output: 520
🏁 Script executed:
# Let's also search for other GoogleGenAI instantiations in the codebase
rg "new GoogleGenAI" -A 3 packages/typescript/ai-gemini/Repository: TanStack/ai
Length of output: 873
🏁 Script executed:
# Check the `@google/genai` version in the project to verify SDK compatibility
cat packages/typescript/ai-gemini/package.json | grep -A 5 -B 5 "@google/genai"Repository: TanStack/ai
Length of output: 264
🏁 Script executed:
# Check more context around the GoogleGenAI initialization to verify it's for Live API
sed -n '180,210p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 907
🏁 Script executed:
# Also check the token.ts file to understand the different use case
cat -n packages/typescript/ai-gemini/src/realtime/token.ts | head -n 50Repository: TanStack/ai
Length of output: 1966
Add v1alpha API version when initializing GoogleGenAI with ephemeral token.
Ephemeral tokens for Gemini Live API are v1alpha-only. When using an ephemeral token as the API key, the client must be initialized with httpOptions: { apiVersion: 'v1alpha' }. Without this configuration, Live API requests will route to the wrong API version and fail.
Proposed fix
const ai = new GoogleGenAI({
- apiKey: token.token
+ apiKey: token.token,
+ httpOptions: {
+ apiVersion: 'v1alpha',
+ },
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const ai = new GoogleGenAI({ | |
| apiKey: token.token | |
| }); | |
| const ai = new GoogleGenAI({ | |
| apiKey: token.token, | |
| httpOptions: { | |
| apiVersion: 'v1alpha', | |
| }, | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 188 -
190, The GoogleGenAI client is being constructed with an ephemeral token
(token.token) but missing the v1alpha API version; update the initialization of
GoogleGenAI (the ai instance) to include httpOptions: { apiVersion: 'v1alpha' }
when using ephemeral tokens so Live API requests use the v1alpha endpoint, e.g.,
detect where new GoogleGenAI({ apiKey: token.token }) is created and add the
httpOptions.apiVersion flag.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/typescript/ai-elevenlabs/src/realtime/adapter.ts (1)
35-47:⚠️ Potential issue | 🟠 MajorWire
optionsandconfigparameters to session initialization.The
connect()method accepts bothoptions(fromelevenlabsRealtime()) and_config(RealtimeSessionConfig) but passes neither tocreateElevenLabsConnection(), causing caller-provided configuration to be silently dropped. WhileElevenLabsRealtimeOptionsincludes meaningful fields (connectionMode,debug), and the Gemini adapter demonstrates proper config wiring, the ElevenLabs implementation ignores both parameters. At minimum, rename_configtoconfigto reflect its intended use and thread it through to session creation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-elevenlabs/src/realtime/adapter.ts` around lines 35 - 47, The connect implementation of elevenlabsRealtime is dropping the caller-provided ElevenLabsRealtimeOptions and RealtimeSessionConfig; rename the unused parameter _config to config in the connect signature and thread both options (from elevenlabsRealtime) and config (RealtimeSessionConfig) into createElevenLabsConnection so session initialization receives connectionMode/debug and session config; update the call sites inside elevenlabsRealtime.connect to pass (token, config, options, clientToolDefs) or the equivalent parameter order expected by createElevenLabsConnection and adjust createElevenLabsConnection invocation accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@packages/typescript/ai-elevenlabs/src/realtime/adapter.ts`:
- Around line 35-47: The connect implementation of elevenlabsRealtime is
dropping the caller-provided ElevenLabsRealtimeOptions and
RealtimeSessionConfig; rename the unused parameter _config to config in the
connect signature and thread both options (from elevenlabsRealtime) and config
(RealtimeSessionConfig) into createElevenLabsConnection so session
initialization receives connectionMode/debug and session config; update the call
sites inside elevenlabsRealtime.connect to pass (token, config, options,
clientToolDefs) or the equivalent parameter order expected by
createElevenLabsConnection and adjust createElevenLabsConnection invocation
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e9ecd1b6-aed5-4bb5-81bc-c287adfd2e67
📒 Files selected for processing (5)
packages/typescript/ai-client/src/realtime-client.tspackages/typescript/ai-client/src/realtime-types.tspackages/typescript/ai-elevenlabs/src/realtime/adapter.tspackages/typescript/ai-gemini/src/realtime/adapter.tspackages/typescript/ai-openai/src/realtime/adapter.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/typescript/ai-client/src/realtime-client.ts
- packages/typescript/ai-client/src/realtime-types.ts
- packages/typescript/ai-openai/src/realtime/adapter.ts
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
packages/typescript/ai-gemini/src/realtime/adapter.ts (2)
218-220:⚠️ Potential issue | 🟠 MajorInitialize
GoogleGenAIwithv1alphafor ephemeral tokens.Gemini’s ephemeral tokens are Live-only and the official docs require the
v1alphaAPI version when using them. Without that, the SDK can hit the wrong websocket/API path and fail to connect. (ai.google.dev)🔧 Proposed fix
const ai = new GoogleGenAI({ - apiKey: token.token - }); + apiKey: token.token, + httpOptions: { + apiVersion: 'v1alpha', + }, + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 218 - 220, The GoogleGenAI client is initialized without the API version so ephemeral tokens can target the wrong websocket/API path; update the GoogleGenAI instantiation (the new GoogleGenAI({ apiKey: token.token }) call) to include the v1alpha API version required for ephemeral tokens (e.g., add the apiVersion/version option set to "v1alpha") so the SDK uses the correct live-only endpoint when using token.token.
209-215:⚠️ Potential issue | 🔴 CriticalEncode PCM as base64 bytes, not CSV.
buf.toString()produces comma-separated integers, but Gemini Live expectsaudio.datato be base64-encoded raw 16-bit PCM bytes. In the current form microphone audio will be rejected or decoded as garbage. (ai.google.dev)🐛 Proposed fix
function convertFloat32ToInt16(buffer: Float32Array) { - let l = buffer.length; - const buf = new Int16Array(l); - while (l--) { - buf[l] = Math.min(1, Math.max(-1, buffer[l]!)) * 0x7fff; - } - return buf.toString(); + const pcm = new Int16Array(buffer.length) + for (let i = 0; i < buffer.length; i++) { + pcm[i] = Math.min(1, Math.max(-1, buffer[i]!)) * 0x7fff + } + + const bytes = new Uint8Array(pcm.buffer) + let binary = '' + for (const byte of bytes) { + binary += String.fromCharCode(byte) + } + return btoa(binary) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 209 - 215, convertFloat32ToInt16 currently returns CSV via buf.toString(); update it to produce base64-encoded raw 16-bit PCM bytes instead: convert the Float32Array samples into an Int16Array (clamped to -1..1 and scaled by 0x7fff) as done in convertFloat32ToInt16, then create a Uint8Array view over the Int16Array.buffer (ensure little-endian PCM ordering) and return a base64 string of those raw bytes (e.g. Buffer.from(uint8Array).toString('base64') in Node or equivalent in browsers); ensure the function still accepts a Float32Array and returns the base64 audio.data string expected by Gemini Live.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts`:
- Around line 107-113: The code destructures provider-specific fields from
config.providerOptions without guarding for it; update the destructuring in
adapter.ts so it handles an undefined RealtimeSessionConfig.providerOptions by
using a safe fallback (e.g., null-coalescing or an early guard) before
extracting languageCode, contextWindowCompression, proactivity,
enableAffectiveDialog, and thinkingConfig; ensure the change references
config.providerOptions and GeminiRealtimeProviderOptions and preserves types
while providing sensible defaults or returning early when providerOptions is
absent.
- Around line 254-281: The transcription guard incorrectly checks
inputTranscription.finished/outputTranscription.finished (Gemini never sends
`finished`), causing transcript events to be skipped; update the checks to only
verify presence of text (inputTranscription.text/outputTranscription.text) and
use response.serverContent?.turnComplete to determine finality (pass that value
as isFinal), and preserve the currentMode switch to 'thinking' and
emit('mode_change') when appropriate; locate this logic around
inputTranscription/outputTranscription and emit('transcript') to apply the
change.
- Around line 512-520: The sendToolResult implementation in sendToolResult is
sending functionResponses as a single object but the Gemini Live API expects an
array of FunctionResponse objects; update the call to session.sendToolResponse
so that the functionResponses property is an array (e.g., [ { id: callId,
response: { result } } ]) and adjust typing if necessary to match the
FunctionResponse[] shape used elsewhere.
- Around line 242-247: The code is incorrectly encoding response.data as UTF-8;
instead extract the base64 PCM from
response.serverContent.modelTurn.parts[].inlineData.data, decode it to binary
and pass the resulting ArrayBuffer to playIncomingAudioChunk; implement a helper
like base64ToArrayBuffer(base64: string) that uses atob and Uint8Array to
produce an ArrayBuffer, and replace the textEncoder.encode(response.data).buffer
call in the adapter (where playIncomingAudioChunk is invoked) with
base64ToArrayBuffer(response.serverContent.modelTurn.parts[i].inlineData.data)
(ensuring you handle the correct part index and null checks).
---
Duplicate comments:
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts`:
- Around line 218-220: The GoogleGenAI client is initialized without the API
version so ephemeral tokens can target the wrong websocket/API path; update the
GoogleGenAI instantiation (the new GoogleGenAI({ apiKey: token.token }) call) to
include the v1alpha API version required for ephemeral tokens (e.g., add the
apiVersion/version option set to "v1alpha") so the SDK uses the correct
live-only endpoint when using token.token.
- Around line 209-215: convertFloat32ToInt16 currently returns CSV via
buf.toString(); update it to produce base64-encoded raw 16-bit PCM bytes
instead: convert the Float32Array samples into an Int16Array (clamped to -1..1
and scaled by 0x7fff) as done in convertFloat32ToInt16, then create a Uint8Array
view over the Int16Array.buffer (ensure little-endian PCM ordering) and return a
base64 string of those raw bytes (e.g.
Buffer.from(uint8Array).toString('base64') in Node or equivalent in browsers);
ensure the function still accepts a Float32Array and returns the base64
audio.data string expected by Gemini Live.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90dd861d-bc24-41c4-b3c2-d27c79ea301b
📒 Files selected for processing (7)
examples/ts-react-chat/src/lib/use-realtime.tspackages/typescript/ai-gemini/src/index.tspackages/typescript/ai-gemini/src/realtime/adapter.tspackages/typescript/ai-gemini/src/realtime/index.tspackages/typescript/ai-gemini/src/realtime/token.tspackages/typescript/ai-gemini/src/realtime/types.tspackages/typescript/ai/src/realtime/types.ts
✅ Files skipped from review due to trivial changes (1)
- packages/typescript/ai-gemini/src/realtime/index.ts
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/typescript/ai-gemini/src/index.ts
- examples/ts-react-chat/src/lib/use-realtime.ts
- packages/typescript/ai-gemini/src/realtime/token.ts
- packages/typescript/ai-gemini/src/realtime/types.ts
| const { | ||
| languageCode, | ||
| contextWindowCompression, | ||
| proactivity, | ||
| enableAffectiveDialog, | ||
| thinkingConfig | ||
| } = config.providerOptions as GeminiRealtimeProviderOptions |
There was a problem hiding this comment.
Guard optional providerOptions before destructuring.
RealtimeSessionConfig.providerOptions is optional, so this throws on the default path before the socket is even opened.
🐛 Proposed fix
const {
languageCode,
contextWindowCompression,
proactivity,
enableAffectiveDialog,
thinkingConfig
- } = config.providerOptions as GeminiRealtimeProviderOptions
+ } = (config.providerOptions ?? {}) as GeminiRealtimeProviderOptions📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { | |
| languageCode, | |
| contextWindowCompression, | |
| proactivity, | |
| enableAffectiveDialog, | |
| thinkingConfig | |
| } = config.providerOptions as GeminiRealtimeProviderOptions | |
| const { | |
| languageCode, | |
| contextWindowCompression, | |
| proactivity, | |
| enableAffectiveDialog, | |
| thinkingConfig | |
| } = (config.providerOptions ?? {}) as GeminiRealtimeProviderOptions |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 107 -
113, The code destructures provider-specific fields from config.providerOptions
without guarding for it; update the destructuring in adapter.ts so it handles an
undefined RealtimeSessionConfig.providerOptions by using a safe fallback (e.g.,
null-coalescing or an early guard) before extracting languageCode,
contextWindowCompression, proactivity, enableAffectiveDialog, and
thinkingConfig; ensure the change references config.providerOptions and
GeminiRealtimeProviderOptions and preserves types while providing sensible
defaults or returning early when providerOptions is absent.
| if (response.data) { | ||
| // TODO: Decode chunk and play using an `AudioWorklet` or | ||
| // buffer them into an AudioContext | ||
|
|
||
| playIncomingAudioChunk(textEncoder.encode(response.data).buffer) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -250 packages/typescript/ai-gemini/src/realtime/adapter.ts | tail -30Repository: TanStack/ai
Length of output: 985
🏁 Script executed:
# Get the exact lines mentioned
sed -n '240,250p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 420
🏁 Script executed:
# Check the full context around those lines (expand window)
sed -n '230,260p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1077
🏁 Script executed:
# Look for response object type definitions
rg "interface.*response|type.*response" packages/typescript/ai-gemini/src/realtime/adapter.ts -A 5Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Search for base64ToArrayBuffer or similar utilities in the codebase
rg "base64|Base64" packages/typescript/ai-gemini/src/realtime/ -iRepository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Check imports and type definitions at the top of the file
head -50 packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1310
🏁 Script executed:
# Find type definitions or interface imports from google/genai
rg "LiveConnectConfig|ServerContent|modelTurn" packages/typescript/ai-gemini/ -A 3Repository: TanStack/ai
Length of output: 2624
🏁 Script executed:
# Look for any response type usage or definition
rg "onmessage.*response" packages/typescript/ai-gemini/src/realtime/adapter.ts -A 20Repository: TanStack/ai
Length of output: 747
🏁 Script executed:
# Check if there's any documentation or test files that show response structure
find packages/typescript/ai-gemini -name "*.test.*" -o -name "*.spec.*" | head -5Repository: TanStack/ai
Length of output: 226
🏁 Script executed:
# Search for response.serverContent usage in the file
rg "response\.serverContent|modelTurn|inlineData" packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 300
🏁 Script executed:
# Look for any existing base64 utilities in the entire ai package
rg "base64|atob|btoa" packages/typescript/ --type tsRepository: TanStack/ai
Length of output: 15127
🏁 Script executed:
# Look at more context around the problematic code and where response.serverContent is actually used
sed -n '230,300p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 2166
🏁 Script executed:
# Check if response.data is used elsewhere or what it might be
rg "response\.data" packages/typescript/ai-gemini/src/realtime/adapter.ts -B 2 -A 2Repository: TanStack/ai
Length of output: 329
🏁 Script executed:
# Look for how audio is currently being played
rg "playIncomingAudioChunk" packages/typescript/ai-gemini/src/realtime/adapter.ts -B 3 -A 3Repository: TanStack/ai
Length of output: 571
🏁 Script executed:
# Search for inlineData in the codebase to see if it's used elsewhere
rg "inlineData" packages/typescript/ai-gemini/Repository: TanStack/ai
Length of output: 1781
🏁 Script executed:
# Check the TTS adapter to see how it correctly handles audio data
cat packages/typescript/ai-gemini/src/adapters/tts.tsRepository: TanStack/ai
Length of output: 6691
🏁 Script executed:
# Look for base64 to ArrayBuffer conversion utilities
rg "atob|base64.*ArrayBuffer|Uint8Array.*base64" packages/typescript/ai-gemini/ -B 2 -A 2Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Check the realtime adapter more thoroughly for context on response types
sed -n '1,100p' packages/typescript/ai-gemini/src/realtime/adapter.ts | grep -A 10 "import\|type.*response"Repository: TanStack/ai
Length of output: 1201
🏁 Script executed:
# Search for documentation or type definitions of the `@google/genai` live response
rg "response.*inlineData|modelTurn.*parts|serverContent" packages/typescript/ai-gemini/src/realtime/ -B 2 -A 2Repository: TanStack/ai
Length of output: 1773
🏁 Script executed:
# Look for existing atob usage or base64 decoding patterns in ai-gemini or ai-openai
rg "atob|Buffer\.from.*base64" packages/typescript/ai-gemini/ packages/typescript/ai-openai/src/adapters/transcription.ts -B 2 -A 2Repository: TanStack/ai
Length of output: 1088
🏁 Script executed:
# Check if there are any utilities files in ai-gemini where such a helper could be added
find packages/typescript/ai-gemini/src -name "*.ts" -type f | head -20Repository: TanStack/ai
Length of output: 1192
🏁 Script executed:
# Look at the full context of response.data usage to understand where it comes from
sed -n '200,260p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1737
🏁 Script executed:
# Look for how response.data is supposed to be structured - check Google Gemini API types
rg "response\.data|LiveServerMessage" packages/typescript/ai-gemini/ -B 2 -A 2Repository: TanStack/ai
Length of output: 815
🏁 Script executed:
# Verify the exact structure by looking at how modelTurn is accessed elsewhere
sed -n '260,330p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1963
🏁 Script executed:
# Check if there's a utils file where base64ToArrayBuffer could be added
ls -la packages/typescript/ai-gemini/src/*.tsRepository: TanStack/ai
Length of output: 346
🏁 Script executed:
# Look for any documentation or comments about response.data in the Google Gemini SDK
rg "response\.data" packages/typescript/ --type tsRepository: TanStack/ai
Length of output: 610
🏁 Script executed:
# Check the complete playIncomingAudioChunk function to understand what it expects
rg "function playIncomingAudioChunk|const playIncomingAudioChunk" packages/typescript/ai-gemini/src/realtime/adapter.ts -A 10Repository: TanStack/ai
Length of output: 452
🏁 Script executed:
# Check if there's any test file for the realtime adapter that shows expected behavior
cat packages/typescript/ai-gemini/tests/gemini-adapter.test.ts | head -100Repository: TanStack/ai
Length of output: 2632
🏁 Script executed:
# Let me verify the exact Google Gemini Live API response structure from their docs
web_search Google Gemini Live API response structure serverContent modelTurn inlineData audioRepository: TanStack/ai
Length of output: 105
🌐 Web query:
Google Gemini Live API response structure serverContent modelTurn inlineData audio base64
💡 Result:
The Google Gemini Live API uses a stateful WebSocket connection for real-time interactions. Server responses are sent as JSON objects conforming to BidiGenerateContentServerMessage protobuf structures. Key structure for model responses: - Top-level: {"serverContent": {...}, "turnComplete": boolean, ... other fields like usageMetadata} - serverContent (BidiGenerateContentServerContent): Contains modelTurn (Content), inputTranscription, outputTranscription, generationComplete (bool), etc. - modelTurn (Content): Has role: "model", parts: array of Part objects. - parts: Can include text (string), inlineData (Blob with data: base64-encoded bytes, mimeType e.g. "audio/pcm;rate=24000" for output audio), functionCall, etc. Audio responses (when response_modalities includes "AUDIO") are delivered in chunks via modelTurn.parts[].inlineData.data as base64-encoded raw PCM audio (24kHz, little-endian, 16-bit). Example processing (Python SDK): async for response in session.receive: if response.server_content and response.server_content.model_turn: for part in response.server_content.model_turn.parts: if part.inline_data: audio_data = part.inline_data.data # base64 str, decode to bytes JavaScript: const content = response.serverContent; if (content?.modelTurn?.parts) { for (const part of content.modelTurn.parts) { if (part.inlineData) { const audioData = part.inlineData.data; // base64 str } } } A single event can contain multiple parts (e.g., inlineData + transcript). Process all parts per event. turnComplete indicates end of model turn. Official protocol reference: https://ai.google.dev/api/live (WebSockets API reference detailing BidiGenerateContentServerContent and nested Content/Part/Blob). Capabilities guide with code examples: https://ai.google.dev/gemini-api/docs/live-api/capabilities
Citations:
- 1: https://ai.google.dev/gemini-api/docs/live-api/capabilities
- 2: https://ai.google.dev/gemini-api/docs/live-guide
- 3: https://ai.google.dev/api/live
- 4: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live
🏁 Script executed:
# Verify if response.data is a real field from the Google Gemini API by checking SDK usage patterns
rg "response\.data|\.data\s*\?" packages/typescript/ai-gemini/src/realtime/adapter.ts -B 3 -A 3Repository: TanStack/ai
Length of output: 432
🏁 Script executed:
# Look at the full picture - see all places where response object is used
sed -n '235,250p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 648
🏁 Script executed:
# Check if there's a utils.ts or helper file in realtime folder
ls -la packages/typescript/ai-gemini/src/realtime/Repository: TanStack/ai
Length of output: 406
Fix audio decoding to read from response.serverContent.modelTurn.parts[].inlineData, not response.data.
The Google Gemini Live API delivers streamed audio as base64-encoded PCM in response.serverContent.modelTurn.parts[].inlineData.data, not in response.data. The current code UTF-8 encodes the base64 string instead of decoding the PCM bytes, breaking audio playback.
🔊 Directional fix
- if (response.data) {
- // TODO: Decode chunk and play using an `AudioWorklet` or
- // buffer them into an AudioContext
-
- playIncomingAudioChunk(textEncoder.encode(response.data).buffer)
-
- if (currentMode !== 'speaking') {
- currentMode = 'speaking'
- emit('mode_change', { mode: 'speaking' })
- }
- }
+ for (const part of response.serverContent?.modelTurn?.parts ?? []) {
+ if (!part.inlineData?.data) continue
+ playIncomingAudioChunk(base64ToArrayBuffer(part.inlineData.data))
+
+ if (currentMode !== 'speaking') {
+ currentMode = 'speaking'
+ emit('mode_change', { mode: 'speaking' })
+ }
+ }Add a helper to decode base64 to ArrayBuffer:
function base64ToArrayBuffer(base64: string): ArrayBuffer {
const binaryStr = atob(base64)
const bytes = new Uint8Array(binaryStr.length)
for (let i = 0; i < binaryStr.length; i++) {
bytes[i] = binaryStr.charCodeAt(i)
}
return bytes.buffer
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 242 -
247, The code is incorrectly encoding response.data as UTF-8; instead extract
the base64 PCM from response.serverContent.modelTurn.parts[].inlineData.data,
decode it to binary and pass the resulting ArrayBuffer to
playIncomingAudioChunk; implement a helper like base64ToArrayBuffer(base64:
string) that uses atob and Uint8Array to produce an ArrayBuffer, and replace the
textEncoder.encode(response.data).buffer call in the adapter (where
playIncomingAudioChunk is invoked) with
base64ToArrayBuffer(response.serverContent.modelTurn.parts[i].inlineData.data)
(ensuring you handle the correct part index and null checks).
| if ( | ||
| inputTranscription && | ||
| inputTranscription.text != undefined && | ||
| inputTranscription.finished != undefined | ||
| ) { | ||
| if (inputTranscription.finished && currentMode !== 'thinking') { | ||
| currentMode = 'thinking' | ||
| emit('mode_change', { mode: 'thinking' }) | ||
| } | ||
|
|
||
| emit('transcript', { | ||
| isFinal: inputTranscription.finished, | ||
| transcript: inputTranscription.text, | ||
| role: 'user', | ||
| }) | ||
| } | ||
|
|
||
| if ( | ||
| outputTranscription && | ||
| outputTranscription.text != undefined && | ||
| outputTranscription.finished != undefined | ||
| ) { | ||
| emit('transcript', { | ||
| isFinal: outputTranscription.finished, | ||
| transcript: outputTranscription.text, | ||
| role: 'assistant', | ||
| }) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat packages/typescript/ai-gemini/src/realtime/adapter.ts | head -300 | tail -100Repository: TanStack/ai
Length of output: 2809
🏁 Script executed:
wc -l packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 112
🏁 Script executed:
sed -n '240,290p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1560
🏁 Script executed:
# Find type definition for BidiGenerateContentTranscription
rg "BidiGenerateContentTranscription" --type ts -A 5Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Check imports to understand where the type comes from
sed -n '1,50p' packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1310
🏁 Script executed:
# Find `@google/genai` types
find . -name "*.d.ts" -o -name "*.ts" | xargs rg "inputTranscription|outputTranscription" -B 3 -A 3 2>/dev/null | head -100Repository: TanStack/ai
Length of output: 3365
🏁 Script executed:
# Check if there are any type definition files or documentation in node_modules
find node_modules/@google/genai -name "*.d.ts" 2>/dev/null | head -5Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Look for type definitions in the codebase
rg "type.*Transcription|interface.*Transcription" --type tsRepository: TanStack/ai
Length of output: 8312
🏁 Script executed:
# Search for any documentation or comments about the transcription payload
rg "finished" packages/typescript/ai-gemini/src/realtime/adapter.ts -B 3 -A 3Repository: TanStack/ai
Length of output: 900
🏁 Script executed:
# Check package.json to see the Google GenAI version
cat packages/typescript/ai-gemini/package.json | grep -A 2 -B 2 "@google/genai"Repository: TanStack/ai
Length of output: 139
🌐 Web query:
Google GenAI live API BidiGenerateContentTranscription payload schema finished field
💡 Result:
The BidiGenerateContentTranscription message in Google's Gemini Live API (part of the GenAI bidirectional streaming API) is used for transcribing audio input or output in real-time conversations over WebSockets. It appears in BidiGenerateContentServerContent as inputTranscription or outputTranscription. Schema (from official protobuf definitions and API reference): - text: string (Transcription text. Output only.) No "finished" field exists in the core protobuf schema for BidiGenerateContentTranscription across Gemini Developer API, Vertex AI, and client libraries (Dart, JS SDKs). Official docs confirm only the 'text' field. Note: Some SDKs (e.g., JS genai) define a 'finished: bool' field expecting it to signal transcription turn completion, but it's not sent by the server (per open GitHub issue #1429, dated 2026-03-23). Developers must use alternative signals like turnComplete: true or outputTranscription arrival. Vertex AI docs mention a 'finished' bool ("The bool indicates the end of the transcription"), but this appears undocumented/inconsistent with protobuf; primary Gemini API docs do not include it. To enable transcriptions, include empty AudioTranscriptionConfig objects in BidiGenerateContentSetup: input_audio_transcription: {} or output_audio_transcription: {} (no fields; languageCodes supported in some backends like Vertex AI). Primary sources: Google AI for Developers Live API reference (ai.google.dev/api/live), protobuf (github.com/googleapis/googleapis), Vertex AI multimodal-live reference.
Citations:
- 1: https://ai.google.dev/api/live
- 2: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live
- 3: https://ai.google.dev/gemini-api/docs/multimodal-live
- 4: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentServerContent-class.html
- 5: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentSetup-class.html
- 6: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentTranscription-class.html
- 7: https://firebase.google.com/docs/ai-logic/live-api/configuration
🌐 Web query:
site:ai.google.dev live API transcription payload structure
💡 Result:
The most prominent and up-to-date (as of 2026) live API for transcription is Google's Gemini Live API, which supports real-time audio input over WebSockets and provides input/output transcriptions. It uses a stateful WebSocket connection at wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent. Key payload structures from official docs: 1. Initial setup message (BidiGenerateContentSetup, sent first after connection): { "model": "string", // e.g., "gemini-3.1-flash-live-preview" "generationConfig": { "candidateCount": integer, "maxOutputTokens": integer, "temperature": number, "topP": number, "topK": integer, "presencePenalty": number, "frequencyPenalty": number, "responseModalities": ["AUDIO"], // or others "speechConfig": object }, "inputAudioTranscription": {}, // Enables input transcription (empty config) "outputAudioTranscription": {}, // Enables output transcription // Optional: systemInstruction, tools, sessionResumption, etc. } Clients wait for server SetupComplete before sending more. 2. Real-time input for live audio (BidiGenerateContentRealtimeInput): Sent as JSON over WS for streaming audio chunks: { "realtimeInput": { "mediaChunks": [ { "data": "base64-encoded-raw-16bit-PCM-audio-chunk", "mimeType": "audio/pcm;rate=16000" // Specify sample rate } ] } } Audio must be raw little-endian 16-bit PCM (input up to 16kHz, output 24kHz). Multiple chunks can be sent concurrently with text/video. 3. Server responses include transcriptions independently: - inputTranscription: { "text": "transcribed user input" } - outputTranscription: { "text": "transcribed model output" } Other messages: clientContent for text/context, toolResponse for functions. For dedicated STT (not conversational), Google Cloud Speech-to-Text uses gRPC/REST streamingRecognize with audio chunks in StreamingRecognizeRequest, but Gemini Live API is the current "live API transcription" focus per recent docs. Examples: - Python SDK: session.send_realtime_input(audio=types.Blob(data=chunk, mime_type="audio/pcm;rate=16000")) - JS: session.sendRealtimeInput({ audio: { data: chunk.toString('base64'), mimeType: 'audio/pcm;rate=16000' } })
Citations:
- 1: https://ai.google.dev/api/live
- 2: https://ai.google.dev/gemini-api/docs/live-api/capabilities
- 3: https://ai.google.dev/gemini-api/docs/live-guide
- 4: https://ai.google.dev/gemini-api/docs/live
🏁 Script executed:
rg "turnComplete|finished" packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 385
🏁 Script executed:
# Check if there are any test files or issues documented
find packages/typescript/ai-gemini -name "*.test.ts" -o -name "*.spec.ts" | head -5Repository: TanStack/ai
Length of output: 226
Remove the finished field checks—Gemini never sends this field in transcription payloads.
The server's BidiGenerateContentTranscription message contains only a text field. The finished field does not exist in the official Google GenAI API schema, so the guards at lines 254–281 always fail and transcript events never fire.
Use response.serverContent?.turnComplete to signal transcription completion instead, as already done elsewhere in the code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 254 -
281, The transcription guard incorrectly checks
inputTranscription.finished/outputTranscription.finished (Gemini never sends
`finished`), causing transcript events to be skipped; update the checks to only
verify presence of text (inputTranscription.text/outputTranscription.text) and
use response.serverContent?.turnComplete to determine finality (pass that value
as isFinal), and preserve the currentMode switch to 'thinking' and
emit('mode_change') when appropriate; locate this logic around
inputTranscription/outputTranscription and emit('transcript') to apply the
change.
| sendToolResult(callId: string, result: string) { | ||
| session.sendToolResponse({ | ||
| functionResponses: { | ||
| id: callId, | ||
| response: { | ||
| result | ||
| } | ||
| } | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's verify the actual code at the specified location
cat -n packages/typescript/ai-gemini/src/realtime/adapter.ts | sed -n '510,525p'Repository: TanStack/ai
Length of output: 562
🏁 Script executed:
# Check if there are type definitions or schema files for the Gemini API
fd -e ts -e tsx | xargs grep -l "functionResponses" | head -20Repository: TanStack/ai
Length of output: 167
🏁 Script executed:
# Check Google Gemini API documentation or check for any local schema/type definitions
find . -type f \( -name "*.ts" -o -name "*.tsx" \) | xargs grep -l "sendToolResponse" | head -20Repository: TanStack/ai
Length of output: 110
🏁 Script executed:
# Check the test file to see how functionResponses is expected to be used
cat packages/typescript/ai-gemini/tests/gemini-adapter.test.ts | grep -A 20 -B 5 "functionResponses"Repository: TanStack/ai
Length of output: 1077
🏁 Script executed:
# Search for type definitions of sendToolResponse and functionResponses
rg "functionResponses|sendToolResponse" -A 5 -B 2 --type tsRepository: TanStack/ai
Length of output: 1953
🏁 Script executed:
# Look for type definitions in node_modules or package definitions
find . -name "*.d.ts" | xargs grep -l "functionResponses" 2>/dev/null | head -10Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Find where the session object comes from in the adapter
head -50 packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 1310
🏁 Script executed:
# Search for session type definition or creation
rg "session.*sendToolResponse|new.*Session|session\s*:" packages/typescript/ai-gemini/src/realtime/adapter.ts -B 5 -A 2Repository: TanStack/ai
Length of output: 278
🏁 Script executed:
# Look for Gemini API imports
rg "import.*google|from.*google|@google" packages/typescript/ai-gemini/src/realtime/adapter.tsRepository: TanStack/ai
Length of output: 163
🏁 Script executed:
# Check package.json to see what Google SDK dependencies are used
cat packages/typescript/ai-gemini/package.json | grep -A 5 -B 5 dependenciesRepository: TanStack/ai
Length of output: 273
🏁 Script executed:
# Look for Google GenAI type definitions in node_modules
find node_modules/@google/genai -name "*.d.ts" 2>/dev/null | head -5Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Try to find the sendToolResponse type definition
rg "sendToolResponse" node_modules/@google/genai -A 10 2>/dev/null | head -50Repository: TanStack/ai
Length of output: 37
🏁 Script executed:
# Search for other functionResponses usage patterns in the codebase
rg "functionResponses.*\[|functionResponses.*\{" --type ts -A 2 -B 2Repository: TanStack/ai
Length of output: 483
🏁 Script executed:
# Check if there are any integration examples or tests that show the correct format
find . -type f \( -name "*.test.ts" -o -name "*.spec.ts" \) | xargs grep -l "sendToolResponse\|functionResponses" 2>/dev/nullRepository: TanStack/ai
Length of output: 115
🌐 Web query:
Google Gemini API Live functionResponses schema sendToolResponse
💡 Result:
In Gemini Live API (WebSockets), the payload type you send back when the server asks you to run tool calls is BidiGenerateContentToolResponse, whose JSON/field schema is: BidiGenerateContentToolResponse - functionResponses[]: array of FunctionResponse Each FunctionResponse object is matched back to the corresponding server-issued tool call by its id. The FunctionResponse shape (as used by the Google GenAI SDKs) is: - id: string (required) — the id of the ToolCall.FunctionCall from the model - name: string — the function name - response: object — your tool’s JSON result payload Example (SDK-style, conceptual): - Build a list of FunctionResponse objects (one per function call) - Call session.send_tool_response(function_responses=...) Example (WebSocket reference shape, conceptual): { "toolResponse": { "functionResponses": [ { "id": "<tool_call_id>", "name": "<function_name>", "response": {"result": "ok"} } ] } } Key constraints from official docs: - The Live API uses dedicated bidi messages for tool calling (ToolCall from server, ToolResponse back to client). - You must include the matching id in each FunctionResponse so the API can map each result to the correct FunctionCall. - Live API doesn’t support automatic tool response handling; you must send the responses yourself via session.send_tool_response. Related guidance for “functionResponses” naming: the WebSockets API reference lists BidiGenerateContentToolResponse.function_responses[] (snake_case) while the higher-level SDK examples use functionResponses (camelCase) in JavaScript objects.
Citations:
- 1: https://ai.google.dev/api/live
- 2: https://ai.google.dev/gemini-api/docs/live-tools?hl=id
- 3: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api/tools
- 4: https://github.com/googleapis/python-genai/blob/main/google/genai/live.py
- 5: https://ai.google.dev/gemini-api/docs/function-calling
🌐 Web query:
@google/genai sendToolResponse functionResponses array documentation
💡 Result:
The sendToolResponse method in the @google/genai JavaScript/TypeScript SDK is used in Live API sessions to send responses back to the model after it requests tool/function calls. It takes an object with a functionResponses property, which is an array of function response objects. Key details from official documentation and examples: - Usage context: Called on a Live session object (returned from ai.live.connect or similar). - Signature: session.sendToolResponse({ functionResponses: functionResponses }) - functionResponses array: Each element is an object with: - name: string - The name of the function/tool called by the model (matches fc.name from the tool call). - id: string - The unique ID from the model's tool call (fc.id), required for mapping. - response: object - The result of executing the function (e.g., { result: "ok" } or any JSON-serializable output). Example from Google AI for Developers docs: if (response.toolCall) { const functionResponses = []; for (const fc of response.toolCall.functionCalls) { const result = myToolFunction(fc.args); functionResponses.push({ name: fc.name, id: fc.id, response: { result } }); } session.sendToolResponse({ functionResponses }); } This is primarily for the experimental Live API (e.g., Gemini Live models like gemini-live-2.5-flash). For standard generateContent, function responses are sent via content history with functionResponse parts. The full TypeScript interfaces (e.g., FunctionResponse) are documented at https://googleapis.github.io/js-genai/release_docs/, where functionResponse in Part includes name, response, and id fields matching the examples.
Citations:
- 1: https://ai.google.dev/gemini-api/docs/live-api/get-started-sdk
- 2: https://googleapis.github.io/js-genai/release%5Fdocs/index.html
- 3: https://github.com/googleapis/python-genai/blob/main/google/genai/live.py
- 4: https://googleapis.github.io/js-genai/release_docs/classes/live.Live.html
🌐 Web query:
Google Gemini real-time API tool response payload format
💡 Result:
The Google Gemini real-time API (Live API) uses WebSocket messages for tool responses. When the model needs to call a tool (function), it sends a BidiGenerateContentToolCall message containing an array of FunctionCall objects. Each FunctionCall has: - name: string (the function name) - args: object (JSON parameters) - id: string (unique ID for matching responses) The client executes the functions and responds with a BidiGenerateContentToolResponse message containing an array of FunctionResponse objects, each matching by id: - id: string - name: string - response: object (JSON result from function execution) Example Python SDK handling (from official docs): function_response = types.FunctionResponse( id=fc.id, name=fc.name, response={"result": "ok"} ) This structure aligns with standard Gemini function calling but uses dedicated WebSocket messages (BidiGenerateContentToolCall/Response) instead of content parts in generateContent responses. For full protobuf details, see ai.google.dev/api/live and Dart API docs reflecting the schema.
Citations:
- 1: https://ai.google.dev/api/live
- 2: https://ai.google.dev/gemini-api/docs/live-tools?hl=id
- 3: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentToolCall-class.html
- 4: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentToolResponse-class.html
- 5: https://ai.google.dev/gemini-api/docs/function-calling
- 6: https://pub.dev/documentation/google_cloud_ai_generativelanguage_v1beta/latest/generativelanguage/BidiGenerateContentServerMessage-class.html
🏁 Script executed:
# Check if the name field is being used elsewhere or if it's optional
rg "\.name" packages/typescript/ai-gemini/src/realtime/adapter.ts | grep -i toolRepository: TanStack/ai
Length of output: 86
Wrap functionResponses in an array.
The Gemini Live API expects functionResponses as an array of FunctionResponse objects, not a single object. The current code sends the wrong payload structure, which will fail server-side tool result matching.
🛠️ Proposed fix
sendToolResult(callId: string, result: string) {
session.sendToolResponse({
- functionResponses: {
- id: callId,
- response: {
- result
- }
- }
+ functionResponses: [
+ {
+ id: callId,
+ response: {
+ result,
+ },
+ },
+ ],
})
},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/typescript/ai-gemini/src/realtime/adapter.ts` around lines 512 -
520, The sendToolResult implementation in sendToolResult is sending
functionResponses as a single object but the Gemini Live API expects an array of
FunctionResponse objects; update the call to session.sendToolResponse so that
the functionResponses property is an array (e.g., [ { id: callId, response: {
result } } ]) and adjust typing if necessary to match the FunctionResponse[]
shape used elsewhere.

🎯 Changes
✅ Checklist
pnpm run test:pr.🚀 Release Impact
Summary by CodeRabbit