results
The benchmark changed the recommendations.
Every named recommendation below now has three fresh baseline/optimized pairs on the same noisy owner-breakdown task. The plugin keeps the practices that reduced the token category they target, and it downgrades tools that added cost or context.
replication status
All tool verdicts on this page are 3x repeated results.
Each suite ran three fresh baseline/optimized pairs from commit b96b8a7 with the same prompt and go test ./... quality gate. All published medium-context aggregates passed quality in all three repeats. Single-run comparison JSONs remain available as historical artifacts, but the verdicts below use aggregate-*.json and the fuller sanitized recordings are committed under docs/benchmarks/primary-data/.
cost scale
The useful number is 24.0%, not four cents.
The Agent Analyzer suite went from $0.2468368 to $0.1876295 at published Claude Sonnet 4.6 API rates, a 23.986% reduction. On comparable recurring usage, that is about $479.73/month on $2,000/month, $1,199.32/month on $5,000/month, or $2,398.64/month on $10,000/month.
token category rule
Output, reasoning, tool-output, and cost are separate claims.
Tool-output and retrieval tools target input/context tokens. Terse-response tools target visible output tokens. Codex exposes reasoning tokens separately. Published API-rate cost reprices exposed token categories, while Claude Code native cost can differ because the product may use internal routing and model work.
Repeated Recommendation Matrix
| Recommendation | Target category | Quality | Estimated tokens | Tool output | Output / reasoning | Published API delta | API-rate percent | Verdict |
|---|---|---|---|---|---|---|---|---|
| Agent Analyzer guidance | Measured workflow and tool-output hygiene | 3/3 | -12,370 | -12,698 | Claude output -504 | -$0.059207 | -24.0% | Positive |
| claude-context | Input/context retrieval through MCP | 3/3 | +7,327 | +4,170 | Claude output +1,169 | +$0.058038 | +26.0% | Negative here |
| claude-rlm | Recursive sub-agent decomposition | 3/3 | +19,477 | +6,020 | Claude root output -1,197 | root-session only; sub-agent cost not exposed | n/a | Negative here |
| context-mode | Tool-output/input-context batching | 3/3 | -12,359 | -13,257 | Claude output +170 | -$0.052175 | -20.4% | Conditional |
| grepai | Path-constrained compact retrieval | 3/3 | -14,567 | -15,571 | Claude output +443 | -$0.037657 | -14.5% | Conditional |
| claude-token-efficient | Visible-output verbosity guidance | 3/3 | -391 | -754 | Claude output -79 | -$0.004208 | -1.8% | Too small |
| RTK | Explicit shell-output compression | 3/3 | -12,446 | -12,716 | Claude output +114 | -$0.044316 | -18.2% | Conditional |
| Probe | Bounded code search | 3/3 | +874 | -745 | Claude output +548 | +$0.038340 | +16.6% | Negative here |
| Semble | Path-limited semantic retrieval | 3/3 | -16,301 | -16,060 | Claude output -480 | -$0.114194 | -41.5% | Positive here |
| Squeez | Explicit shell-output compression | 3/3 | -8,471 | -8,917 | Claude output +73 | -$0.028224 | -12.1% | Removed: Spec Kitty conflict |
| ccusage / ccstatusline | Telemetry only | n/a | n/a | n/a | n/a | n/a | n/a | Visibility |
Codex And Caveman Controls
| Harness | Intervention | Quality | Analyzer estimated | Tool output | Native token signal | Published API delta | Verdict |
|---|---|---|---|---|---|---|---|
Claude Code -p |
Agent Analyzer plugin | 3/3 | -12,370 | -12,698 | native cost -$0.044219 | -$0.059207 | Positive |
Codex exec --json |
Agent Analyzer text guidance | 3/3 | -14,520 | -14,527 | uncached+output -24,369; reasoning -45 | -$0.062392 | Positive here |
Claude Code -p |
Caveman terse-output pressure | 3/3 | +4,355 | +4,868 | native cost +$0.009919; output -370 | +$0.009211 | Negative here |
Codex exec --json |
Caveman terse-output pressure | 3/3 | -9,210 | -9,109 | uncached+output -4,739; reasoning -2 | -$0.033986 | Harness-specific |
Product Conclusions
- Keep Agent Analyzer as the core: it produced repeated reductions in estimated tokens, tool output, visible output, native Claude cost, and published-rate estimated cost.
- Narrow the paid pack: default guidance now includes Agent Analyzer workflow, output-budgeted commands, retrieval hygiene, session hygiene, and retry breaking.
- Recommend only working reducers by category: Semble, context-mode, RTK, and grepai remain conditional recommendations tied to matching findings.
- Do not recommend telemetry as a reducer: ccusage and ccstatusline help users see spend and context pressure, but they do not directly reduce input, output, tool-output, or reasoning tokens.
- Remove negative, conflicting, or too-small tools from default advice: claude-context, Probe, Caveman for Claude, claude-rlm, claude-token-efficient, and Squeez no longer ship as default token-saving recommendations.
- Do not use RLM as proof for this fixture: claude-rlm is designed for truly long contexts, and on this medium-context task its sub-agent decomposition added analyzer-estimated tokens, tool output, and failed commands.
- Keep negative and mixed evidence visible: users can audit what failed, but the download path now points only at what worked.
Artifacts
- Machine-readable benchmark summary
- Agent Analyzer 3x aggregate
- Codex guidance 3x aggregate
- claude-context 3x aggregate
- claude-rlm 3x aggregate
- context-mode 3x aggregate
- grepai 3x aggregate
- RTK 3x aggregate
- Probe 3x aggregate
- Semble 3x aggregate
- Squeez 3x aggregate
- claude-token-efficient 3x aggregate
- Caveman Claude 3x aggregate
- Caveman Codex 3x aggregate
- Repeated benchmark suite policy and commands
- Published API cost translation note
- Primary sanitized benchmark recordings in git
run locally
Generate your own report and plugin.
npx --yes agent-analyzer@latest run
Runs on your machine first, asks before upload, and turns the sanitized report into a targeted plugin for your detected waste.