methodology
Same task. Same commit. Fresh sessions.
The benchmark is designed to falsify our claims. If a plugin-assisted run does not preserve task quality and reduce measured waste, it is reported as a null or negative result.
Benchmark Protocol
- Pick a bounded coding task with a deterministic quality gate.
- Reset two isolated git worktrees to the same fixed commit.
- Run the baseline task in Claude Code
-pmode using the exact task prompt. - Analyze the baseline Claude Code JSONL log locally with Claude Analyzer.
- Generate the Agent Analyzer plugin from that sanitized baseline report.
- Run the optimized task from a fresh worktree and fresh Claude Code session.
- Analyze the optimized log locally, then compare quality, estimated tokens, tool-output tokens, failed commands, rereads, and context-growth events.
Replication Standard
One baseline/optimized pair is only a first-pass A/B result. Because LLM runs vary, recommendation verdicts should be promoted to repeated evidence only after at least three fresh baseline/optimized pairs pass the same quality gate from the same fixed commit and prompt.
The permanent runner is scripts/benchmark-repeat.sh. It invokes the Claude or Codex harness three times by default, writes separate run-01, run-02, and run-03 directories, then emits an aggregate.json with mean, median, min, max, and standard deviation for every numeric delta.
The current proof suite has 3x aggregate artifacts for every named tool recommendation on the results page. MCP-backed claude-context repeats use per-run collection names so the semantic index does not leak across repeats.
The sanitized primary recordings are committed to the repository under docs/benchmarks/primary-data/. They include aggregate files, per-run comparison JSON, task prompts, exit statuses, and quality-gate output. Raw Claude/Codex logs and copied worktrees remain excluded by the privacy boundary.
Quality Gate
The benchmark does not count token savings unless both runs complete the same task and pass the same verification command. The Go benchmark tasks used go test ./... as the external quality gate.
What We Measure
We do not treat every token count as the same kind of evidence. A tool can reduce visible output while increasing total session cost, and a telemetry tool can improve accountability without directly reducing any tokens.
| Metric | Why it matters |
|---|---|
| Estimated tokens | Analyzer estimate of session context consumed by messages and tool output. |
| Input/context tokens | Prompt text, project instructions, file reads, tool results, MCP schemas, and prior conversation context that are sent back to the model. |
| Tool-output tokens | The subset of input/context tokens produced by shell, MCP, file, and search outputs. Large shell output and broad discovery are often the most actionable waste source. |
| Output tokens | Visible assistant text and tool-call JSON emitted by the model. Terse-response tools target this category, but output-token reductions alone do not prove lower total cost. |
| Reasoning tokens | Hidden model reasoning budget reported by some harnesses. These usually change with task complexity, retries, and planning/tool loops; our tool recommendations do not claim direct reasoning-token savings unless measured. |
| Cached input tokens | Reused context that may be billed or quota-weighted differently. Cache behavior can make dollar cost diverge from live context quality. |
| Avoidable waste range | Heuristic range for context that likely could have been avoided without lowering task quality. |
| Failed commands | Retries and avoidable command failures are a direct cost and a signal of poor workflow guidance. |
| Rereads and context-growth events | Signals that Claude is rediscovering context instead of using focused navigation. |
| Actual Claude cost | Useful but not sufficient. Cache behavior can lower dollars even when the live context workflow gets worse. |
| Published API cost estimate | Reprices exposed token categories with current public API rates. This is separate from Claude Code or Codex native billing because those products can use credits, subscriptions, routing, and internal model work. |
| Scaled savings | Converts a repeated benchmark percent into weekly or monthly spend examples. For Agent Analyzer: ($0.2468368 - $0.1876295) / $0.2468368 = 23.986%, so comparable monthly savings are baseline spend * 0.23986. |
Privacy Boundary
Raw Claude Code logs stay local. The public proof pages include only sanitized Agent Analyzer reports, aggregate metrics, benchmark commit identifiers, tool versions, and the interpretation needed to understand the result.
We do not publish raw prompts, private paths, secrets, unknown private tool names, or raw transcript content.
Why This Shapes The Plugin
The plugin is intentionally conservative because the benchmark showed that passive advice and extra tools can add context. The generated pack therefore favors concrete low-output workflows, explicit command recipes, and narrow recommendations over broad “install everything” guidance.
Negative and telemetry-only results changed the product: ccusage and ccstatusline stay measurement-only, while claude-context, Probe, Caveman for Claude, claude-rlm, and claude-token-efficient are no longer default reducer recommendations.
Cross-Harness Extension
The same fixed task now also runs through Codex exec --json. Because the generated artifact is a Claude Code plugin, the Codex optimized run uses equivalent text guidance instead of plugin loading. Codex comparisons report Analyzer event-stream metrics plus Codex's own usage tokens; they do not include Claude Code dollar-cost fields.
How We Compare External Benchmarks
External token-savings studies use different protocols, so we compare methodology before comparing percentages. Fixture replay, live A/B calls, MCP overhead tests, and CLAUDE.md profile tests answer different questions.
run locally
Apply the same local-first flow.
npx --yes agent-analyzer@latest run
Analyzes supported logs locally, asks before upload, and builds a custom plugin from the waste patterns in the sanitized report.