Feature/multimodal observations#388
Open
fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
Open
Feature/multimodal observations#388fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
fatemehpesaran310 wants to merge 7 commits intoServiceNow:mainfrom
Conversation
New observation fields when enable_audio=True: - audio_segment: raw WAV bytes (for omni models that accept audio) - audio_transcript: Whisper transcription (for LLM agents) Uses PulseAudio virtual sink to capture browser audio at the OS level, bypassing DRM. Requires pactl, parec, and ffmpeg on Linux. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces PulseAudio-based approach with pure in-browser capture: - Uses captureStream() + MediaRecorder on <audio>/<video> elements - No system dependencies (no PulseAudio, no ffmpeg) - Works cross-platform (Linux, Mac, Windows) - Works with self-hosted websites (no DRM) Two output modes: - audio_segment: raw audio bytes (webm/opus) for omni models - audio_transcript: Whisper transcription for LLM agents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- audio.py: transcribe_audio() now defaults to OpenAI Whisper API, with use_api=False for local model fallback - test_audio_capture.py: standalone test for audio capture pipeline - test_llm_agent.py: multi-step LLM agent test with cross-modality tasks (audio comprehension + browser action), uses WebArena-style bid-based actions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes 15 tasks (0-9 text-only, 10-14 cross-modality): - 10-11: Audio comprehension (listen to voice messages, answer questions) - 12-13: Audio + browser action (listen, navigate, post based on audio) - 14: Video comprehension (identify text shown in video) Tasks 10-14 require enable_audio=True and Whisper transcription. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New observation field when enable_video=True:
- video_frames: list of {timestamp, base64, image} dicts extracted
from <video> elements using JavaScript Canvas API
Extracts N evenly-spaced frames at 720p resolution. Default 10 frames
(~11k tokens for VLMs). No system dependencies required.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two paths for video understanding: - VLM agent: receives raw frames directly as images - LLM agent: frames are captioned by GPT-4o, descriptions passed as text New functions: - describe_video_frames(): sends frames to GPT-4o, returns text descriptions - format_frame_descriptions(): formats descriptions for agent observations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add audio observation in browsergym.