Skip to content

Feature/multimodal observations#388

Open
fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
fatemehpesaran310:feature/multimodal-observations
Open

Feature/multimodal observations#388
fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
fatemehpesaran310:feature/multimodal-observations

Conversation

@fatemehpesaran310
Copy link
Copy Markdown

Add audio observation in browsergym.

Fatemeh Pesaran Zadeh [Affiliate] and others added 7 commits April 14, 2026 00:19
New observation fields when enable_audio=True:
- audio_segment: raw WAV bytes (for omni models that accept audio)
- audio_transcript: Whisper transcription (for LLM agents)

Uses PulseAudio virtual sink to capture browser audio at the OS level,
bypassing DRM. Requires pactl, parec, and ffmpeg on Linux.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces PulseAudio-based approach with pure in-browser capture:
- Uses captureStream() + MediaRecorder on <audio>/<video> elements
- No system dependencies (no PulseAudio, no ffmpeg)
- Works cross-platform (Linux, Mac, Windows)
- Works with self-hosted websites (no DRM)

Two output modes:
- audio_segment: raw audio bytes (webm/opus) for omni models
- audio_transcript: Whisper transcription for LLM agents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- audio.py: transcribe_audio() now defaults to OpenAI Whisper API,
  with use_api=False for local model fallback
- test_audio_capture.py: standalone test for audio capture pipeline
- test_llm_agent.py: multi-step LLM agent test with cross-modality
  tasks (audio comprehension + browser action), uses WebArena-style
  bid-based actions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes 15 tasks (0-9 text-only, 10-14 cross-modality):
- 10-11: Audio comprehension (listen to voice messages, answer questions)
- 12-13: Audio + browser action (listen, navigate, post based on audio)
- 14: Video comprehension (identify text shown in video)

Tasks 10-14 require enable_audio=True and Whisper transcription.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New observation field when enable_video=True:
- video_frames: list of {timestamp, base64, image} dicts extracted
  from <video> elements using JavaScript Canvas API

Extracts N evenly-spaced frames at 720p resolution. Default 10 frames
(~11k tokens for VLMs). No system dependencies required.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two paths for video understanding:
- VLM agent: receives raw frames directly as images
- LLM agent: frames are captioned by GPT-4o, descriptions passed as text

New functions:
- describe_video_frames(): sends frames to GPT-4o, returns text descriptions
- format_frame_descriptions(): formats descriptions for agent observations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant