Problem Description
When auditing a workflow run with a failed detection job (where the squid proxy container crashed), the ToolCalls metrics in run_summary.json contain 126 entries with 68 unique tool names — but the agent only executed ~11 descriptive shell commands and 1 MCP noop call. The metrics include:
- 54 phantom GitHub/safeoutputs tool calls that the agent never actually invoked
- 22 tools duplicated under two naming conventions —
github-get_me (dash) and github___get_me (triple-underscore) both appear independently
noop listed multiple times with different counts (totaling 7 across entries instead of 1)
- Shell command labels treated as tool names — descriptive names like
Check, Extract, Inspect, Load, Save, Select, Update, Fetching appear because they are labels Claude Code assigns to shell calls
The audit MCP tool surfaces this corrupted data, showing tool_types=68 in observability insights (implying 68 distinct tool types were used) when in reality only ~2 were used (shell + MCP safeoutputs).
Reproduction
- Run: §24299193223 —
GPL Dependency Cleaner (gpclean), April 12 2026
- Root cause event: Detection job failed because
awf-squid container exited with code 1 mid-run
- Tool:
audit MCP tool with run_id_or_url: 24299193223
Steps to Reproduce
- Identify a run where the detection job failed due to a container crash (e.g., squid proxy exit)
- Call the
audit tool on that run ID
- Inspect the
tool_usage array in the response or run_summary.json → metrics.ToolCalls
Expected Behavior
ToolCalls should list only tools actually invoked by the agent:
- Shell tool calls (with their descriptive labels, e.g.,
Load cache-memory state, Check transitive deps)
- MCP tools (e.g.,
noop)
- Total: ~12 entries for this run
Actual Behavior
126 entries in ToolCalls, including:
github-list_branches: 3 github___list_branches: 2 ← same tool, two formats
github-get_me: 3 github___get_me: 2 ← same tool, two formats
safeoutputs-create_issue: 3 safeoutputs___create_issue: 2
noop: 1, noop: 1, noop: 4, noop: 1 ← duplicate noop entries
Check: 2, Inspect: 2, Extract: 2, Load: 1, Save: 1, Select: 1, Update: 1, Fetching: 1
tool: 2, tool: 1 ← word "tool" as a tool name
All 22 GitHub API tools and all 5 safeoutputs tools appear in both github-xxx and github___xxx format, each counted 2-3 times. These tools were never called — the agent only did shell commands and one noop.
Hypothesis
The tool call parser scans a log stream that includes both the system prompt (which enumerates all available tools) and the actual agent output. When the detection job crashes partway through, the partial data stream causes:
- Every available tool to be counted as "called" (from the tool list in the prompt context)
- The same tool appearing in two formats depending on which log section was scanned
- Descriptive shell command labels being treated as tool names
Comparison: successful runs from the same session (24381951145, 24381981463) have ToolCalls: null in run_summary.json and show correct tool_types=0 in the audit output.
Impact
- Severity: Medium — misleading observability data
- Frequency: Reproducible when detection job fails due to container crash
- User impact:
audit reports tool_types=68 for a simple read-only run, causing incorrect optimization recommendations and inflated resource profiling
Environment
Suggested Fix
- Deduplicate
ToolCalls entries before writing to run_summary.json, merging entries with the same base name (normalizing github___xxx and github-xxx to a canonical form)
- Add a filter to exclude tool names that appear in the system prompt tool list but have no corresponding invocation event
- Skip
ToolCalls population when the detection job exits with a container failure code
Generated by Daily CLI Tools Exploratory Tester · ● 2M · ◷
Problem Description
When auditing a workflow run with a failed detection job (where the squid proxy container crashed), the
ToolCallsmetrics inrun_summary.jsoncontain 126 entries with 68 unique tool names — but the agent only executed ~11 descriptive shell commands and 1 MCPnoopcall. The metrics include:github-get_me(dash) andgithub___get_me(triple-underscore) both appear independentlynooplisted multiple times with different counts (totaling 7 across entries instead of 1)Check,Extract,Inspect,Load,Save,Select,Update,Fetchingappear because they are labels Claude Code assigns toshellcallsThe
auditMCP tool surfaces this corrupted data, showingtool_types=68in observability insights (implying 68 distinct tool types were used) when in reality only ~2 were used (shell + MCP safeoutputs).Reproduction
GPL Dependency Cleaner (gpclean), April 12 2026awf-squidcontainer exited with code 1 mid-runauditMCP tool withrun_id_or_url: 24299193223Steps to Reproduce
audittool on that run IDtool_usagearray in the response orrun_summary.json→metrics.ToolCallsExpected Behavior
ToolCallsshould list only tools actually invoked by the agent:Load cache-memory state,Check transitive deps)noop)Actual Behavior
126 entries in
ToolCalls, including:All 22 GitHub API tools and all 5 safeoutputs tools appear in both
github-xxxandgithub___xxxformat, each counted 2-3 times. These tools were never called — the agent only did shell commands and onenoop.Hypothesis
The tool call parser scans a log stream that includes both the system prompt (which enumerates all available tools) and the actual agent output. When the detection job crashes partway through, the partial data stream causes:
Comparison: successful runs from the same session (
24381951145,24381981463) haveToolCalls: nullinrun_summary.jsonand show correcttool_types=0in the audit output.Impact
auditreportstool_types=68for a simple read-only run, causing incorrect optimization recommendations and inflated resource profilingEnvironment
0eb9816Suggested Fix
ToolCallsentries before writing torun_summary.json, merging entries with the same base name (normalizinggithub___xxxandgithub-xxxto a canonical form)ToolCallspopulation when the detection job exits with a container failure code