coding-benchmark

Here are 3 public repositories matching this topic...

redush-com / SaotriBench

Saotri Bench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.

python benchmark machine-learning code-generation evaluation-framework ai-agents llm llm-evaluation agent-evaluation coding-benchmark

Updated Feb 21, 2026
Python

slappymambadoo / claude-code-local-qwen-case-study

Star

Raw logs of Claude Code running on local Qwen3.5-27B (llama.cpp). Builds a Python todo app with 50 tests. Real-world performance data: 30 min, cache thrashing, 38 t/s generation.

Updated Apr 14, 2026
Python

dikatwoone / FluxCodeBench

Star

🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench, focusing on hidden requirements, long-context retention, and iterative refinement.

python benchmark machine-learning code-generation evaluation-framework ai-agents llm llm-evaluation agent-evaluation coding-benchmark

Updated Apr 15, 2026
Python

Improve this page

Add a description, image, and links to the coding-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the coding-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly