Saotri Bench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
-
Updated
Feb 21, 2026 - Python
Saotri Bench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
Raw logs of Claude Code running on local Qwen3.5-27B (llama.cpp). Builds a Python todo app with 50 tests. Real-world performance data: 30 min, cache thrashing, 38 t/s generation.
🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench, focusing on hidden requirements, long-context retention, and iterative refinement.
Add a description, image, and links to the coding-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the coding-benchmark topic, visit your repo's landing page and select "manage topics."