benchmark-backed plugin proof

We measured the advice before claiming savings.

The plugin is crafted from actual Claude Code logs, then tested against fresh runs from the same commit and prompt. The result is not a generic “install more tools” bundle: it keeps the practices that reduced the token category they target and downgrades recommendations that did not prove out.

current finding

Agent Analyzer guidance reduced measured token waste and API-rate cost in repeated controlled runs.

On the larger noisy repo, three fresh Agent Analyzer guided runs averaged 12,370 fewer estimated tokens, 12,698 fewer tool-output tokens, 504 fewer Claude output tokens, and $0.044219 lower native Claude Code cost while preserving the quality gate. Repriced at published Claude Sonnet 4.6 API rates, the repeated mean delta was -$0.059207, or 23.986% lower cost.

That percentage is the honest scale-up unit. One task saved cents; a team doing comparable Claude Sonnet coding work at $5,000/month would save about $1,199/month. The tooltip basis is simple: baseline $0.2468368, optimized $0.1876295, delta -$0.0592073.

Every named tool recommendation now has a three-repeat aggregate. The plugin no longer says “install everything”; it ships the core Agent Analyzer workflow, conditionally recommends only tools that reduced cost in the repeated suite, and removes telemetry-only or negative tools from the reducer path.

-12,370 mean estimated-token delta across three noisy-repo Agent Analyzer guided trials
-12,698 mean tool-output token delta in the repeated Agent Analyzer guided trial
0->0 quality gate status for every published comparison on this page
-24,369 mean Codex uncached-plus-output token delta with equivalent guidance
$1,199/mo scaled Agent Analyzer savings on $5,000/month of comparable Claude Sonnet API-equivalent coding usage

What The Plugin Actually Does

1

Measures the baseline log locally

Claude Analyzer identifies avoidable waste such as large shell output, broad discovery, repeated reads, retry loops, and context growth spikes without uploading raw transcripts.

2

Generates scoped Claude guidance

The plugin turns the measured findings into commands, skills, and a reviewer agent that steer the next session toward narrower reads, quieter verification, and lower-output workflows.

3

Keeps only evidence-backed claims

Tools that performed well are recommended narrowly by token category. Tool-output reducers and retrieval tools are conditional, telemetry-only tools are kept out of the reducer pack, and tools that did not reduce full-session cost are removed from default advice instead of hidden.

Recommendation Verdicts

Recommendation Benchmark verdict How the plugin uses that evidence
Agent Analyzer guidance Positive Keep as the core plugin behavior and make the workflow more direct.
Quiet package-scoped testing Built in Keep as part of Agent Analyzer guidance because the repeated plugin run used focused reads and quiet verification to reduce tool-output tokens.
ccusage Telemetry Keep out of the default reducer pack. Use only as independent accounting if the user asks for visibility.
claude-context Removed Do not recommend for this workflow. Indexing and MCP-search overhead did not amortize in three fresh runs.
context-mode Conditional Recommend only for tool-output/context bloat. Repeated runs reduced cost 20.4%, but visible output rose on average.
grepai Conditional Recommend only as input/context retrieval with compact output, small limits, and path filters; repeated cost savings were 14.5%.
ccstatusline Telemetry Keep out of the default reducer pack. It can improve awareness but does not directly reduce input, output, tool-output, or reasoning tokens.
claude-token-efficient Too small Do not ship as a default reducer. The repeated savings were 1.8%, useful only as manual verbosity hygiene.
Caveman Removed Keep out of default Claude plugin guidance. It reduced Codex native tokens in this fixture but made Claude Code estimated/tool-output tokens and cost worse.
RTK Conditional Recommend explicit commands such as rtk go test ./... before any global shell hooks; repeated cost savings were 18.2%.
Probe, Semble, Squeez Split result Probe was removed. Semble is a positive repeated retrieval recommendation at 41.5% cost savings. Squeez had a positive old shell-output result, but is removed because it conflicts with Spec Kitty workflows.
methodology How the benchmark isolates plugin impact Same prompt, same commit, separate worktrees, local analyzer reports, and a hard quality gate. results What passed, what failed, and what changed Representative trial table, recommendation verdicts, and sanitized artifacts. landscape How public benchmarks compare to our proof Replay tests, live A/B studies, MCP overhead tests, and why each result means something different.

Artifacts

run locally

Measure your own agent logs.

npx --yes agent-analyzer@latest run

Analyzes recent sessions locally, asks before upload, and uses the sanitized report to build a custom plugin for the waste it detects.