benchmark landscape
Other benchmarks agree on one thing: method matters.
Public token-saving claims use very different protocols. Some replay transcripts through compressors, some run live A/B calls, some test CLAUDE.md profiles, and some measure tool or MCP overhead. We compare those methods before borrowing their conclusions.
research finding
There is no single universal “token reduction” benchmark.
The strongest external evidence supports specific mechanisms: prompt caching for cached input, smaller tool schemas and quieter shell output for input/context, compact profile guidance for input and visible output, and search/navigation that prevents duplicate reads. The weakest evidence comes from installing broad retrieval or MCP tooling without proving Claude actually uses it to avoid context. Most public studies do not separately report reasoning tokens.
External Benchmark Comparison
| Source | Methodology | Reported result | Token category | What it proves | How we use it |
|---|---|---|---|---|---|
| Anthropic Claude Code cost guidance | Official guidance, not an A/B benchmark. Recommends usage tracking, clearing stale context, compaction, fewer MCP servers, CLI tools where possible, and hooks that filter noisy test output. | No single headline percentage; it defines the vendor-endorsed cost-control mechanisms. | Input/context, tool-output, telemetry | Context size, MCP/tool overhead, and noisy tool output are first-order cost drivers. | We treat these as mechanism candidates, then require local task benchmarks before product claims. |
| Anthropic token-saving updates | Platform-level API benchmarks for prompt caching and token-efficient tool use. | Prompt caching can reduce cost up to 90% and latency up to 85%; token-efficient tool use reduces output tokens up to 70%, with 14% average reduction among early users. | Cached input, output | Repeated prompt prefixes and tool-use output can be reduced substantially at the platform layer. | We separate platform savings from Claude Code plugin workflow savings. |
| The Distillery | Replay of eight realistic multi-turn Claude Code fixture sessions through an optimization pipeline, using chars/4 token estimation. | 20% reproducible reduction on 124,580 raw tokens; 30-60% in heavier real sessions depending on pattern. | Transcript input/context estimate | Transcript compression, deduplication, and output filtering can reduce fixed session payloads. | Useful benchmark shape for future replay tests, but not a substitute for fresh task A/B runs. |
| Tamp v0.8.0 whitepaper | 216 live A/B calls: 12 scenarios across 18 configurations routed through OpenRouter, judged by Claude Sonnet Haiku 4.5. | L5 balanced default reports 47.56% token savings with 216/216 quality retention. | Task total, category split unclear | Compression can preserve judged task quality across a controlled scenario grid. | We borrow the A/B + quality-retention framing while using repository tests for our own quality gate. |
| TechLoom CLAUDE.md benchmark | 1,188 total runs across three models, 12 coding tasks, and 10 instruction profiles. Scores tests, lint, complexity, and LLM judgment. | CLAUDE.md compression saved only 5-13% actual API tokens; compressed profiles hurt Haiku/Sonnet in several cases; an empty profile won overall on generic tasks. | Input/context instructions and task total | Instruction compression can save input tokens but still damage quality or create unnecessary overhead. | Supports our conservative stance on claude-token-efficient: small diffs, no blanket overwrites. |
| Boarder copy/paste MCP benchmark | Five baseline and five MCP runs in isolated containers on a 700-line Express monolith split into 11 modules. | Baseline averaged 70.9K tokens and $1.22; MCP averaged 81.0K tokens and $1.34, roughly 10% higher cost. | Task total and MCP input/context overhead | Adding a tool or MCP abstraction can increase cost when it is not used consistently enough to replace broad context. | Matches our mixed local results: claude-context and Probe added cost here. Semble and RTK earned scoped recommendations after 3x task runs; Squeez is removed because it conflicts with Spec Kitty workflows. |
| Token Savior | Vendor benchmark on 96 real coding tasks with Claude Opus 4.7, emphasizing structural navigation and memory. | Claims active tokens per task fell from 17,221 to 3,395, an 80% reduction, and wall time fell 83%. | Active context/task total | Symbol-level navigation and memory may be large wins when the agent reliably uses them. | Good candidate mechanism for future reproduction, but currently external vendor evidence. |
| ComputingForGeeks tested-tool roundup | Practical tool tests across a small 52-file benchmark plus vendor data comparisons. | Reports code-review-graph savings around 5% on a small repo and notes RTK showed 0% on their bash task but is useful on noisy output. | Tool-output and task total | Small-repo overhead can erase benefits, and output compression depends on workload shape. | Reinforces our conditional verdicts for RTK and search tools. |
| Local-Splitter paper | Open-source shim across MCP and OpenAI-compatible HTTP. Evaluates seven tactics across edit-heavy, explanation-heavy, chat, and RAG-heavy workloads. | Local routing plus prompt compression saves 45-79% cloud tokens on edit/explanation workloads; full tactic set saves 51% on RAG-heavy workloads. | Cloud input/output total by workload | Workload-specific routing and compression can outperform any one universal technique. | Supports adding separate benchmark families for edit, explanation, RAG, and long-session work. |
| StackOne MCP optimization comparison | Compares schema compression, search-first discovery, response filtering, and code-mode execution for MCP token use. | Reports approach-level ranges, including 70-97% schema compression in relevant MCP scenarios. | MCP schema/response input context | MCP savings often come from reducing tool definitions and responses before they enter context. | Guides future MCP benchmark design; it is not direct evidence for our plugin by itself. |
Where Our Benchmark Is Stricter
- Fresh task execution: baseline and optimized runs are independent Claude Code
-psessions, not only replayed transcript compression. - Same starting point: every comparison starts from the same fixed commit, task prompt, and quality command.
- External quality gate: token savings are not counted unless the repository verification command passes on both sides.
- Negative findings stay visible: tools that added searches, duplicate reads, or turns are labeled negative or conditional instead of folded into a blended claim.
What We Still Need To Benchmark
The public literature suggests that one Go task is not enough to make universal claims. The next benchmark families should cover long multi-hour sessions, monorepo navigation, MCP-heavy workflows, noisy UI test logs, and CLAUDE.md/profile changes. Those are separate mechanisms and should get separate proof.
Product Rule
Claude Analyzer should say exactly what kind of evidence backs each recommendation. Local A/B results can support product proof. External benchmarks can identify promising mechanisms. Vendor claims require reproduction before they become default install advice.
run locally
Benchmark your own waste patterns.
npx --yes agent-analyzer@latest run
Runs locally, uploads only the sanitized report you approve, and builds a custom plugin for the waste it actually finds.