The benchmark changed the recommendations.

Every named recommendation below now has three fresh baseline/optimized pairs on the same noisy owner-breakdown task. The plugin keeps the practices that reduced the token category they target, and it downgrades tools that added cost or context.

replication status

All tool verdicts on this page are 3x repeated results.

Each suite ran three fresh baseline/optimized pairs from commit b96b8a7 with the same prompt and go test ./... quality gate. All published medium-context aggregates passed quality in all three repeats. Single-run comparison JSONs remain available as historical artifacts, but the verdicts below use aggregate-*.json and the fuller sanitized recordings are committed under docs/benchmarks/primary-data/.

cost scale

The useful number is 24.0%, not four cents.

The Agent Analyzer suite went from $0.2468368 to $0.1876295 at published Claude Sonnet 4.6 API rates, a 23.986% reduction. On comparable recurring usage, that is about $479.73/month on $2,000/month, $1,199.32/month on $5,000/month, or $2,398.64/month on $10,000/month.

token category rule

Output, reasoning, tool-output, and cost are separate claims.

Tool-output and retrieval tools target input/context tokens. Terse-response tools target visible output tokens. Codex exposes reasoning tokens separately. Published API-rate cost reprices exposed token categories, while Claude Code native cost can differ because the product may use internal routing and model work.

Repeated Recommendation Matrix

Recommendation	Target category	Quality	Estimated tokens	Tool output	Output / reasoning	Published API delta	API-rate percent	Verdict
Agent Analyzer guidance	Measured workflow and tool-output hygiene	3/3	-12,370	-12,698	Claude output -504	-$0.059207	-24.0%	Positive
claude-context	Input/context retrieval through MCP	3/3	+7,327	+4,170	Claude output +1,169	+$0.058038	+26.0%	Negative here
claude-rlm	Recursive sub-agent decomposition	3/3	+19,477	+6,020	Claude root output -1,197	root-session only; sub-agent cost not exposed	n/a	Negative here
context-mode	Tool-output/input-context batching	3/3	-12,359	-13,257	Claude output +170	-$0.052175	-20.4%	Conditional
grepai	Path-constrained compact retrieval	3/3	-14,567	-15,571	Claude output +443	-$0.037657	-14.5%	Conditional
claude-token-efficient	Visible-output verbosity guidance	3/3	-391	-754	Claude output -79	-$0.004208	-1.8%	Too small
RTK	Explicit shell-output compression	3/3	-12,446	-12,716	Claude output +114	-$0.044316	-18.2%	Conditional
Probe	Bounded code search	3/3	+874	-745	Claude output +548	+$0.038340	+16.6%	Negative here
Semble	Path-limited semantic retrieval	3/3	-16,301	-16,060	Claude output -480	-$0.114194	-41.5%	Positive here
CodeGraph	Pre-indexed code-graph retrieval through MCP	3/3	+6,094	+4,046	Claude output +450	+$0.095826	+54.3%	Research-only
Headroom MCP	Explicit MCP output compression	3/3	-138	-266	Claude output -29	-$0.002205	-1.3%	Not recommended
Headroom proxy	Anthropic-compatible proxy compression	3/3	-1,109	-759	Claude output -233	+$0.084046	+49.7%	Not recommended
Gathon	Pre-indexed knowledge-graph retrieval through MCP	3/3	+9,471	+5,042	Claude output +2,072	+$0.136216	+80.8%	Not recommended
Squeez	Explicit shell-output compression	3/3	-8,471	-8,917	Claude output +73	-$0.028224	-12.1%	Removed: Spec Kitty conflict
ccusage / ccstatusline	Telemetry only	n/a	n/a	n/a	n/a	n/a	n/a	Visibility

Candidate Queue

Candidate	Target category	Evidence state	Promotion gate	Current product status
CodeGraph	Pre-indexed code-graph retrieval through MCP	3/3 quality-passing diagnostic; cost/tokens increased	Needs a new fixture or approach with lower full-session cost before promotion	Research-only
Headroom MCP	Explicit MCP output compression	3/3 quality-passing diagnostic; mixed/noisy savings with one repeat regressing	Needs stronger repeated proof with consistent full-session cost reduction before promotion	Research-only / not recommended
Headroom proxy	Anthropic-compatible proxy compression	3/3 quality-passing diagnostic; estimated tokens fell but API-rate cost rose 49.7%	Needs lower full-session cost, not just lower estimated context, before promotion	Research-only / not recommended
Gathon	Pre-indexed knowledge-graph retrieval through MCP	3/3 quality-passing diagnostic; cost/tokens increased	Needs lower full-session cost before promotion	Research-only / not recommended

Candidate rows are not recommendations. They stay out of generated artifacts until local Agent Analyzer proof shows lower full-session cost and a later reviewed change promotes them.

Codex And Caveman Controls

Harness	Intervention	Quality	Analyzer estimated	Tool output	Native token signal	Published API delta	Verdict
Claude Code `-p`	Agent Analyzer plugin	3/3	-12,370	-12,698	native cost -$0.044219	-$0.059207	Positive
Codex `exec --json`	Agent Analyzer text guidance	3/3	-14,520	-14,527	uncached+output -24,369; reasoning -45	-$0.062392	Positive here
Claude Code `-p`	Caveman terse-output pressure	3/3	+4,355	+4,868	native cost +$0.009919; output -370	+$0.009211	Negative here
Codex `exec --json`	Caveman terse-output pressure	3/3	-9,210	-9,109	uncached+output -4,739; reasoning -2	-$0.033986	Harness-specific

Product Conclusions

Keep Agent Analyzer as the core: it produced repeated reductions in estimated tokens, tool output, visible output, native Claude cost, and published-rate estimated cost.
Narrow the paid pack: default guidance now includes Agent Analyzer workflow, output-budgeted commands, retrieval hygiene, session hygiene, and retry breaking.
Recommend only working reducers by category: Semble, context-mode, RTK, and grepai remain conditional recommendations tied to matching findings.
Do not recommend telemetry as a reducer: ccusage and ccstatusline help users see spend and context pressure, but they do not directly reduce input, output, tool-output, or reasoning tokens.
Remove negative, conflicting, or too-small tools from default advice: claude-context, Probe, Caveman for Claude, claude-rlm, claude-token-efficient, and Squeez no longer ship as default token-saving recommendations.
Do not use RLM as proof for this fixture: claude-rlm is designed for truly long contexts, and on this medium-context task its sub-agent decomposition added analyzer-estimated tokens, tool output, and failed commands.
Keep negative and mixed evidence visible: users can audit what failed, but the download path now points only at what worked.

Artifacts

run locally

Generate your own report and plugin.

npx --yes agent-analyzer@latest run

Runs on your machine first, asks before upload, and turns the sanitized report into a targeted plugin for your detected waste.

Start from the analyzer homepage