Feature/multimodal observations by fatemehpesaran310 · Pull Request #388 · ServiceNow/BrowserGym

fatemehpesaran310 · 2026-04-16T07:34:11Z

Add audio observation in browsergym.

New observation fields when enable_audio=True: - audio_segment: raw WAV bytes (for omni models that accept audio) - audio_transcript: Whisper transcription (for LLM agents) Uses PulseAudio virtual sink to capture browser audio at the OS level, bypassing DRM. Requires pactl, parec, and ffmpeg on Linux. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces PulseAudio-based approach with pure in-browser capture: - Uses captureStream() + MediaRecorder on <audio>/<video> elements - No system dependencies (no PulseAudio, no ffmpeg) - Works cross-platform (Linux, Mac, Windows) - Works with self-hosted websites (no DRM) Two output modes: - audio_segment: raw audio bytes (webm/opus) for omni models - audio_transcript: Whisper transcription for LLM agents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- audio.py: transcribe_audio() now defaults to OpenAI Whisper API, with use_api=False for local model fallback - test_audio_capture.py: standalone test for audio capture pipeline - test_llm_agent.py: multi-step LLM agent test with cross-modality tasks (audio comprehension + browser action), uses WebArena-style bid-based actions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Includes 15 tasks (0-9 text-only, 10-14 cross-modality): - 10-11: Audio comprehension (listen to voice messages, answer questions) - 12-13: Audio + browser action (listen, navigate, post based on audio) - 14: Video comprehension (identify text shown in video) Tasks 10-14 require enable_audio=True and Whisper transcription. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New observation field when enable_video=True: - video_frames: list of {timestamp, base64, image} dicts extracted from <video> elements using JavaScript Canvas API Extracts N evenly-spaced frames at 720p resolution. Default 10 frames (~11k tokens for VLMs). No system dependencies required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two paths for video understanding: - VLM agent: receives raw frames directly as images - LLM agent: frames are captioned by GPT-4o, descriptions passed as text New functions: - describe_video_frames(): sends frames to GPT-4o, returns text descriptions - format_frame_descriptions(): formats descriptions for agent observations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fatemeh Pesaran Zadeh [Affiliate] and others added 7 commits April 14, 2026 00:19

Add quick start README for webarena_pro audio tasks

6db33da

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/multimodal observations#388

Feature/multimodal observations#388
fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
fatemehpesaran310:feature/multimodal-observations

fatemehpesaran310 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fatemehpesaran310 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant