Tool Inflation and the Quiet Vindication of the Command Line

In November 2025, Anthropic's engineering team published a post with a deliberately understated title: "Code execution with MCP: building more efficient AI agents." The headline claim was a 98.7% reduction in token usage on a representative multi-tool workflow, from roughly 150,000 tokens down to 2,000. Cloudflare had introduced the same architectural idea in September 2025 in "Code Mode: the better way to use MCP," and by February 2026 had released a Cloudflare MCP server exposing 2,500+ API endpoints through just two tools and a code sandbox, a roughly 99.9% reduction in protocol overhead.

Both posts were technically about Model Context Protocol, the open standard Anthropic launched in November 2024 to solve the N-by-M integration problem between AI models and external tools. But read them carefully and they are saying something more uncomfortable. The protocol both companies have invested heavily in works best when most of the agent's work happens somewhere else, in a code execution environment, with MCP reduced to a transport.

This is a quiet vindication of an argument that had been gathering steam since the summer. The most prominent statement came from Armin Ronacher, creator of Flask, whose July 2025 post "Tools: Code Is All You Need" opened with the now-famous line that MCP "suffers from two major flaws: it isn't truly composable, and it demands too much context." By August, Geoffrey Huntley's "too many model context protocol servers" post documented that adding the popular GitHub MCP server alone "defines 93 additional tools and swallows another 55,000 of those valuable tokens." Huntley's piece also reported that Cursor caps users at 40 MCP tools, and that harness builders openly criticized Microsoft for lifting Visual Studio Code's 128-tool cap as encouraging bad practice.

The thesis of this piece is straightforward.

For the agentic workloads executives actually care about most, namely developer productivity, data engineering, internal automation, and operations, a command-line interface is structurally a better target for AI than a tool-calling protocol like MCP. That is not because MCP is broken. It is because LLMs have far more training exposure to shell and CLI patterns than to any tool-calling protocol, because schema definitions compete directly with reasoning for context budget, and because code composes natively while protocol-level tool calls do not.

The more interesting strategic question, the one your CTO should be sketching on a whiteboard this quarter, is not "CLI or MCP." It is how to layer your AI stack so that the model does as much work as possible inside a code execution environment, with MCP, REST, and bash all serving as transport underneath. The architectures that Anthropic and Cloudflare just shipped are early evidence of this convergence.

Here is how we got here, what each interface is actually good at, and what it means for the way you allocate engineering and platform investment over the next eighteen months.

How we got here: from N×M to tool inflation

The pre-MCP world was a wiring problem. Every model provider had a different function-calling spec. Every internal tool had a different REST schema. Every team that wanted an agent to "do something useful with our systems" wrote bespoke glue, and that glue rotted every time either side shipped a release. The Model Context Protocol, introduced by Anthropic in November 2024, was a sensible response. Standardize the tool-discovery and invocation layer, and the integration matrix collapses from N×M to N+M.

The adoption curve was sharp. Through 2025, OpenAI, Google, and Microsoft each added MCP support across their agent platforms, though the form and maturity varied by product, and AWS released MCP-related tooling and servers. On December 9, 2025, Anthropic donated the protocol to the newly formed Agentic AI Foundation under the Linux Foundation, co-founded with Block and OpenAI and backed by Google, Microsoft, AWS, Cloudflare, and Bloomberg. By that point there were over 10,000 published MCP servers. For much of 2025, shipping an MCP server became one of the more visible ways a company signaled it was serious about agentic AI.

Then came tool inflation.

Two facts about modern LLMs made this inevitable. First, every tool definition exposed to a model lives in context for the entire conversation. A medium-complexity MCP server is several thousand tokens of schema; a large one runs to tens of thousands. Second, the marginal cost of each additional tool definition is not just tokens but also attention quality. Models can reason worse, not just more expensively, when their context is half-full of unused tool descriptions before the user has typed a word.

The Cloudflare team made this explicit. In their February 2026 launch post, they reported that exposing all 2,500+ Cloudflare API endpoints as traditional MCP tools would consume up to 1.17 million tokens, more than the context window of the most advanced foundation models. The public MCP server repository sets a lower bound at 244,000 tokens even with minimal schemas, against an OpenAPI spec that totals roughly 2 million tokens. With Code Mode's search-plus-execute pattern, the same surface area fits in roughly 1,000 tokens. The token cost of the protocol was, by some measures, larger than the cognitive cost of the task it was meant to support.

Meanwhile, a parallel observation was traveling through developer circles. The CLIs everyone already had on disk, gh, kubectl, git, curl, aws, jq, were quietly outperforming their MCP equivalents on agent benchmarks. Ronacher's July post crystallized the case. In August, Mario Zechner built a deliberately fair head-to-head: the same underlying tool exposed as both an MCP server and a CLI, then benchmarked across 120 runs on three terminal-control tasks. His honest conclusion was that, when both tools are well designed, "MCP vs CLI truly is a wash," with both versions hitting 100% success rates. The headline win for CLI, in his test, came from comparing badly designed MCP wrappers to mature standard tools, not from any inherent protocol advantage. By February 2026, Eric Holmes's "MCP is dead. Long live the CLI" reached the Hacker News front page with the bluntest version of the argument: "Ship a good API, then ship a good CLI. The agents will figure it out." The pendulum was swinging.

It is worth pausing on what the pendulum actually swung toward, because "CLI versus MCP" is the wrong framing. The deeper claim is that code is a better interface for an LLM than structured tool calls, and CLIs are simply the most readily available form of code that already wraps every system worth automating. Once you see the argument that way, Anthropic's Code Execution with MCP, Cloudflare's Code Mode, and the broader Skills pattern all become variations on the same theme.

Why CLIs work so well for LLMs: three structural advantages

1. Training data gravity

Imagine you are training a model on roughly the entire public internet plus a sizable fraction of permissively licensed code. How much of that data describes calling gh pr list --json title,number? Quite a lot. How much describes calling a freshly defined MCP tool called github_list_pull_requests with a JSON schema you wrote last week? Effectively none.

The model does not just know that gh and kubectl and git exist. It knows the idioms. It has seen git log --oneline | head -20 thousands of times. It has seen kubectl get pods -n production -o wide | grep CrashLoopBackOff. It has seen find . -name '*.py' | xargs grep -l TODO. These patterns are baked into the weights. The model's prior on what a sensible CLI invocation looks like is enormous.

MCP tool definitions, by contrast, are novel at inference time. Even with high-quality descriptions, the model must reason about an unfamiliar abstraction on the fly, every turn. Ronacher described this directly: "current MCP will always be harder to use than writing code, primarily due to the reliance on inference."

This is not a temporary state of affairs that will be fixed by larger context windows or better schemas. It is a consequence of how training data is distributed. The Unix tools are the long-form blog posts, GitHub issues, Stack Overflow answers, man pages, and tutorial blog posts of fifty years of computing. Your MCP server is a press release.

2. Composition through code, not inference

This is the deeper advantage and the one that took the industry longest to internalize.

When an agent uses MCP, every step of a workflow round-trips through the model. The agent picks a tool, the runtime executes it, the result returns to the model as text or JSON, the model decides on the next tool, and so on. Each round trip is a separate inference pass with its own latency, cost, and risk of error compounding.

When an agent uses a shell, it can compose. A single bash one-liner can pull pull requests from one service, filter them with jq, transform the result with awk, and pipe the output into a downstream command. No inference between steps. The intermediate data never enters the model's context. Ronacher's example of writing an entire data-transformation pipeline as a script that the model invokes once, rather than as a sequence of MCP calls the model orchestrates turn by turn, is the canonical case.

The Anthropic engineering post on Code Execution with MCP captures the same dynamic from inside the protocol. Their Google Drive plus Salesforce example imports a long meeting transcript and writes it to a CRM record. Done as discrete MCP calls, the transcript flows through the model context twice. Done as code, it never touches the model at all. The result is the 98.7% token reduction.

Cloudflare's version is even starker. Their two-tool MCP server is a search function (find the right endpoint in the OpenAPI spec) and an execute function (run JavaScript that calls the API). Composition happens in the V8 isolate, not in the model's reasoning. The agent writes a five-line script that orchestrates a multi-step workflow against any of 2,500 endpoints in one inference pass.

The empirical story is not entirely one-sided. The team at Ranger ran a careful Playwright CLI versus Playwright MCP benchmark in early 2026 and reported that while the CLI was dramatically more context-efficient, the MCP version was about twice as fast in wall-clock time and slightly cheaper, because the MCP returned richer state per call and required fewer round trips. Context efficiency, they noted, "does not correlate to speed" in their setup. Zechner's earlier benchmark reached a compatible conclusion from the other direction: when both interfaces are well designed, the protocol matters less than how the tool surfaces information. The takeaway is not that composition through code is universally faster. It is that the design of your interface determines what the agent has to reason about, and reasoning is the bottleneck. When the CLI returns sparse output, the agent makes more calls. When the MCP returns rich state, it makes fewer. Both can be tuned.

3. Self-documenting failure modes

The third advantage is unglamorous but operationally significant. When a CLI command fails, it prints a stderr message a human or a model can read. When an MCP call fails, the failure can land at any of several layers, transport, serialization, OAuth scope, or the tool server itself, and the quality of that error message varies widely. Ronacher again: MCP failures are "incredibly hard to debug."

This matters more in production than it does in demos. Agents in real workflows hit edge cases constantly. The interface that surfaces failures legibly recovers faster. Bash has fifty years of accumulated convention about exit codes, stderr conventions, and "did you mean" suggestions. MCP servers vary enormously in how they report errors, and the variance is itself a source of bugs.

Where MCP actually wins

The CLI-first narrative is correct enough that it has hardened into a take, which is when you should be most suspicious of it. There are three regimes in which MCP, or something MCP-shaped, is materially better than handing the agent a shell.

Authentication and multi-tenancy. When an agent acts on behalf of a specific user against a system they do not own, the OAuth 2.1 layer that the MCP spec defines is meaningful. A CLI typically assumes one user's credentials are on the local machine. An MCP server can hold scoped tokens, refresh them, downscope them per request, and revoke them, all without the agent ever seeing the secret. This is exactly the surface where letting an agent shell out becomes a security incident. Cloudflare's enterprise MCP architecture, published in early 2026, leans on this property hard. For any agent that touches customer systems or regulated data, the auth boundary alone is worth the protocol.

Curated, vendor-blessed semantics. Cameron Cooke, author of the MCP-vs-CLI post on Async Let and creator of XcodeBuildMCP, built the server specifically because the relevant Apple CLIs were undocumented, version-dependent, and platform-quirky enough that the model kept getting them wrong. His buildAndRun MCP tool encapsulates dozens of steps an iOS developer would normally take, presented as a single stable contract. For complex domains where the underlying CLI surface is genuinely hostile to inference, a curated MCP server is a better interface than the raw tools, even though it costs context. The bet is that the higher reliability per call more than pays for the extra schema.

Cross-application reach. A CLI assumes a shell. Many agentic products do not have one and should not. When Claude.ai or ChatGPT Desktop or a custom enterprise chat product wants to give a user the ability to read their CRM or schedule a meeting, MCP gives the vendor a single integration that works across many client surfaces. The model does not need a sandbox. The user does not need to install anything. For consumer and enterprise SaaS products where the AI sits inside someone else's app, MCP's protocol-first design is exactly the right shape.

The honest reading of mid-2026 is that MCP is excellent at the things it was designed for, namely interoperable, authenticated, vendor-blessed access to external systems from inside chat-style products, and weak at the things it was being asked to do by maximalists, namely orchestrating long, composable, code-heavy agent workflows.

REST and OpenAPI: the substrate everyone forgets

Step back one level and a useful observation appears. Both gh and the GitHub MCP server are skins over GitHub's REST API. Both the AWS CLI and the AWS MCP servers ultimately call AWS service APIs over HTTPS. Cloudflare's MCP, as their public repo states plainly, wraps their OpenAPI spec.

The CLI-versus-MCP debate is really a debate about what shape of skin to put on REST, and where the inference loop should happen relative to that skin.

This matters strategically. A clean, well-maintained OpenAPI specification is becoming the most valuable artifact a platform can produce in the agent era. From one spec, modern tooling will generate idiomatic CLIs (Stainless, Speakeasy, OpenAPI Generator), typed SDKs in a dozen languages, MCP servers, and Code Mode handlers. The spec is the source of truth and every interface downstream of it is a compile target.

Vendors with messy or undocumented APIs face a compounding tax. Their CLIs will be inconsistent enough that agents will mis-call them. Their MCP servers, if they ship them, will be hand-crafted, hard to maintain, and expensive in context. Their developer experience will degrade not just for humans but for the millions of agent instances that increasingly act on humans' behalf.

The reverse is also true. Companies that have invested in API hygiene, Stripe being the canonical example, get a quiet flywheel. Their REST surface is legible, their SDKs are auto-generated, their CLI is consistent, and an agent that has any of these can flip between them as the situation calls for. The Anthropic post on advanced tool use mentions that Tool Search Tool with deferred loading improved Opus 4 accuracy on MCP evaluations from 49% to 74%, and Opus 4.5 from 79.5% to 88.1%. The mechanism, lazy schema loading, is essentially the same trick the OpenAPI plus Code Mode pattern uses. The underlying truth is that agents reward legibility, not protocol choice.

The synthesis: code execution as the real interface

If you squint at the three most-discussed architectural patterns of late 2025 and early 2026, they collapse into the same shape.

Anthropic's Code Execution with MCP turns each MCP server into a directory of TypeScript files on disk. The agent navigates the filesystem to discover tools, imports only the ones it needs, and writes code that calls them. Tool definitions cost zero tokens until the agent decides to read them. Intermediate data lives in the execution environment, not the model context.

Cloudflare's Code Mode exposes a single execute tool that runs the agent's JavaScript inside a V8 isolate. The isolate has a typed SDK against the OpenAPI spec. The agent writes a short script; the script orchestrates calls; only the final result flows back through the model.

Anthropic's Skills, which Simon Willison called "maybe a bigger deal than MCP" at launch, sit one layer higher. A Skill is a Markdown recipe that lives idle in the model's awareness, costs about thirty tokens at rest, and expands into a focused playbook when the model recognizes a matching task. The Skill tells the agent which CLI or MCP tool to invoke and how, without dumping every tool's schema into context up front.

What all three share is a separation of two things that early MCP conflated: the catalog of what the agent can do, and the act of doing it. The catalog should live on disk or in a search index, not in context. The act should happen inside a code execution environment, not as a sequence of round-trip tool calls.

In practice, the mature 2026 agent architecture looks layered:

Skills or rules files describe what the agent should do for recurring tasks. Tiny footprint, loaded on demand.
A code execution sandbox is the agent's actual hands. The agent writes code, the code runs, the result returns.
CLIs and language-native SDKs are the most natural libraries for the code to call, because the model knows them best.
MCP servers are the right transport when authentication, multi-tenancy, or vendor curation matters more than composability.
REST plus OpenAPI sits beneath all of it as the canonical contract between agent and service.

This is not zero-sum. It is layering. And it is exactly the architecture both Anthropic and Cloudflare are now publishing as their reference design. The protocol war narrative, where MCP must defeat OpenAPI or be defeated by it, misreads what is happening. The protocols are converging into a stack.

Layered agent architecture. Top: Skills / Rules. Middle: Code execution sandbox. Bottom row: CLIs, SDKs, MCP servers, raw REST. All sitting above the services they wrap. — One agent. One code runtime. Multiple interfaces. Any system. Code is the universal adapter.

What this means for executives

Five implications follow from the convergence.

Treat OpenAPI as the most valuable artifact your platform produces

If you own a platform, internal or external, the single highest-leverage engineering investment for the agent era is a complete, well-versioned, accurately documented OpenAPI specification. From that spec, your CLI, SDKs, MCP server, and Code Mode handler are all derivable. Without it, every downstream surface is hand-crafted, divergent, and expensive to maintain. The companies whose APIs become the substrate for thousands of agentic workflows over the next two years will mostly be the ones that already had API hygiene; the others will spend the next two years catching up.

Concretely: audit your spec. If it lags your actual API, fix that first. If your CLI is hand-written rather than generated, ask whether that is a real differentiator or accumulated debt.

Budget context like a finite resource on the P&L

The most common operational mistake observed in enterprise agent deployments through 2025 was loading five or six MCP servers into a default harness without ever measuring what fraction of the context window that consumed. A harness that boots with 60,000 tokens of protocol overhead has less than half its useful window left for reasoning on a 128,000-token model.

There is a numeracy gap to close inside most platform teams. Token usage per call, per task, and per harness configuration should be observable, attributable, and reported the way latency and error rate already are. The infrastructure for this exists, OpenTelemetry-style tracing for LLM calls is maturing fast, but the discipline of looking at it is not yet routine. It needs to be.

Code execution is the security surface to harden

The instinctive concern about CLI-first agents is that handing the model a shell with broad credentials is a wider attack surface than handing it a narrow MCP server. That concern is correct as stated and almost beside the point. The actual answer the industry is settling on is that the agent gets a sandbox, not a host shell. Cloudflare's V8 isolates, Anthropic's filesystem-isolated execution environments, container-based sandboxes like Apple's container framework or Docker with strict resource limits, Firecracker microVMs, gVisor: this is the new perimeter.

The pattern is consistent across the leading harnesses. Strip the agent's environment to a minimum viable shell, give it ephemeral credentials scoped to the task, log every tool call and every shell command, and trust nothing produced inside the sandbox to leave it without inspection. For executives, the practical takeaway is that the security review for an agent harness is not "do we allow shell access" but "what does the sandbox look like and who maintains it."

A new infrastructure role is forming

The job description does not yet have a stable name. Some call it agent harness engineer, some AI infrastructure engineer, some platform engineer for AI. The skill profile is distinct from ML research, MLOps, or traditional platform engineering, though it overlaps with all three. The work is: design tool catalogs, profile context usage, write the bridge layers between Skills and execution environments, maintain the sandbox, and tune the harness for cost and reliability.

The companies that staff this role aggressively in 2026 will look back in 2027 the way teams that hired SREs in 2010 looked back in 2013. The talent is scarce and the leverage is high. If you are running an AI-heavy product, expect to allocate one such hire per significant agent surface.

The protocol war is mostly noise

It is tempting to read the CLI-versus-MCP debate as another protocol religion fight, like REST versus GraphQL or gRPC versus REST a decade ago. The deeper story is that the protocols are layering, not competing. The architectural decision that matters is where you put the inference loop relative to the code execution environment. Get that right and the protocol underneath becomes substitutable. Get it wrong and no protocol will save you.

For most enterprise buyers, the practical posture is to be protocol-agnostic at the boundary and code-execution-centric at the core. Use MCP where vendors give it to you and it fits the auth and curation story. Use CLIs and SDKs where the model already knows them. Use OpenAPI as the spine. Spend your differentiation budget on the harness, the sandbox, and the Skills.

Outlook: what survives every UI revolution

There is a temptation in technology writing to declare winners. The honest read is that bash is not "winning." Bash is doing what bash has done for fifty years, which is to outlive whatever was supposed to replace it. The web did not kill the CLI. IDEs did not kill the CLI. Cloud consoles did not kill the CLI. The agent era will not either, because the agent era turns out to need exactly the property bash has always had: composable, text-streaming, programmable interfaces that compose without the system needing to think between steps.

What will change is the audience. Through 2024, CLIs were a tool for developers. Through 2026, they are increasingly a tool for AI agents acting on behalf of developers, operators, analysts, and eventually anyone whose work involves a system that has a command-line interface. The shells your engineers love because they are productive are about to become the runtime your AI workloads depend on because they are legible.

The question to leave your team with this week is not "should we adopt MCP." MCP is already broadly adopted, and where it fits it works. The question is the one Anthropic and Cloudflare both just answered in their own architecture: when our agent needs to do real work, are we letting it write code, or are we making it talk to our systems one tool call at a time?

If you cannot answer that quickly, your context budget probably already shows it.