Artificial Analysis shows coding agents are more than model scores

Artificial Analysis has put a sharper number on a truth developers already feel: the best coding agent is no longer just the best model.

On May 11, Artificial Analysis introduced its Coding Agent Benchmarks, and the timing matters. The market has spent much of the past year arguing over which frontier model writes better code. This index points to a more practical question for startups and engineering teams: which full agent setup actually gets useful work done at a sane cost, in a reasonable amount of time, inside the workflow developers already use?

That is a very different race. A model can be strong at code generation and still stumble when it has to inspect a repository, run commands, recover from failed tests, or decide when to stop. Anyone who has used these tools for real work knows the gap. The agent is not just the brain. It is the harness, the terminal loop, the file editing strategy, the context management, the caching behavior and the product surface wrapped around the model.

According to Artificial Analysis, the new Coding Agent Index combines three benchmarks: SWE-Bench-Pro-Hard-AA with 150 code generation questions, Terminal-Bench v2 with 84 agentic terminal-use questions, and SWE-Atlas-QnA with 124 technical Q&A questions. The headline score is the average pass@1 across three runs of each benchmark, which gives the result more texture than a single patch-writing leaderboard.

The most interesting part is not only the composite score. Artificial Analysis also holds Claude Opus 4.7 constant and compares how it performs through different coding-agent harnesses, including Cursor, Claude Code and OpenCode. That simple choice changes the conversation. If the same model behaves differently depending on the tool around it, then the tool is not a neutral wrapper. It is part of the capability.

This matters for founders because many buying decisions still start with a model name. That made sense when coding assistants were mostly autocomplete boxes and chat panels. It makes less sense when agents are planning tasks, editing multiple files, running tests, using terminals and producing reviewable diffs. In that world, the better product may be the one that wastes fewer turns, keeps better context, avoids unnecessary token burn and fits naturally into a team's review process.

Cost is also moving from a finance detail to a product feature. The Artificial Analysis page breaks out token usage, cached input, output tokens and mean API cost per task. That is not trivia. A startup running agents across a busy engineering team can quickly find that two tools with similar headline quality have very different effective costs because one benefits more from prompt caching or consumes less context to reach the same answer.

Latency carries the same lesson. The benchmark tracks agent wall time per task, not only raw model speed. That distinction is important. A fast model inside a clumsy loop can feel slow, while a slower model inside a tighter workflow can finish sooner because it asks fewer questions, runs fewer redundant commands and makes cleaner decisions. Developers do not experience tokens per second. They experience waiting.

Benchmarks are catching up to real work

The benchmark mix also reflects where software engineering work actually lives. SWE-Bench-Pro-Hard-AA stresses implementation. Terminal-Bench v2 tests whether an agent can operate in a command-line environment. SWE-Atlas-QnA checks whether it can understand a codebase deeply enough to answer technical questions before changing anything. Scale AI's SWE-Atlas work, published this month, makes the same broader point: useful engineering agents need comprehension, validation and maintenance skills, not only patch generation.

That is especially relevant for small teams. A large company can absorb some agent waste through process, dedicated tooling and internal evaluation. A startup usually cannot. If an agent breaks tests, misunderstands architecture or burns through an expensive context window, the cost shows up immediately in developer time. The promise of coding agents is not that they replace engineering judgment. It is that they reduce the amount of low-value work around that judgment.

The index is still not a perfect buyer's guide. No public benchmark can fully model a messy private codebase, a product manager changing requirements, a flaky CI pipeline or a senior engineer reviewing a risky migration. Artificial Analysis is clear that users still need to read the component scores alongside the composite. That caveat is not a weakness. It is the right way to read the market.

The practical takeaway is simple. Model choice still matters, but it is becoming a weaker proxy for developer productivity. The next durable advantage in coding agents may sit in harness design, context handling, caching, latency, repo awareness and workflow integration. For founders, that means the question is no longer which model wins the leaderboard. It is which agent helps your team ship more correct code with less drag.

Also read: Unitree turns a mecha demo into a robotics startup test • AI deepfakes accuse Singapore of ingratitude toward China on social media • Banks are forcing a harder stablecoin rewards test for the CLARITY Act