Anthropic's Claude Opus 4.8 outpaces GPT-5.5 on benchmarks while making AI agents smarter and more honest

Anthropic released Claude Opus 4.8 on May 28, putting its flagship model back at the top of several public benchmarks while making a more practical promise: fewer confident mistakes when agents are working on real tasks.

Anthropic is moving fast enough now that even point releases can feel like market events. Claude Opus 4.8 arrived 41 days after Opus 4.7, with the same standard API price, broader agent tools, and enough benchmark movement to put fresh pressure on OpenAI and Google at the high end of the model market.

According to Artificial Analysis, Opus 4.8 scored 61.4 on its Intelligence Index, ahead of GPT-5.5 at 60.2. Anthropic also says the model scored 69.2 percent on SWE-bench Pro, compared with 64.3 percent for Opus 4.7 and 58.6 percent for GPT-5.5. Those are not abstract bragging rights for buyers. They point to a model that is getting better at messy software work, where the output is only useful if it survives tests, edge cases, and the parts of a codebase nobody remembered to document.

The biggest academic-looking jump is on USAMO 2026, the American mathematics olympiad benchmark used to test advanced reasoning. Opus 4.8 rose from 69.3 percent to 96.7 percent versus the prior generation. That does not prove a clean architectural breakthrough on its own, but it does suggest better handling of formal, multi-step problems where a plausible answer is not enough. For developers and enterprise teams, that matters because the same discipline is needed when an agent is tracing dependencies, refactoring services, or checking whether a plan actually follows from the facts in front of it.

The headline product feature is Dynamic Workflows in Claude Code. The research preview lets Claude plan a large job, write orchestration scripts, and run tens or hundreds of parallel subagents inside a single session before verifying the result. Anthropic gives codebase-scale migrations across hundreds of thousands of lines as the clearest example. That is the right use case to watch, because large migrations are where today’s agents often look impressive for the first hour and then fall apart when coordination becomes more important than raw code generation.

Opus 4.8 keeps the one million token context window on the Claude API, Amazon Bedrock, and Google Cloud Vertex AI, while Microsoft Foundry support is listed at 200,000 tokens. Maximum output is 128,000 tokens. Fast mode runs at roughly 2.5 times standard speed and is now priced at ten dollars per million input tokens and fifty dollars per million output tokens, down sharply from Opus 4.7 fast mode. For teams running high-volume coding or analysis workflows, that price cut may be more meaningful than a single benchmark win.

The honesty upgrade matters more than it sounds

Anthropic is making a narrower and more useful claim than simply saying the model is smarter. It says Opus 4.8 is around four times less likely than Opus 4.7 to let flaws in its own code pass without comment. The system card has also been widely cited for showing the first Claude model to score zero percent on uncritically reporting flawed results, with a more than tenfold reduction in overconfidence versus Opus 4.7.

That matters because autonomous agents fail in ways that are different from chatbots. A chatbot can give a bad answer and be corrected in the next message. An agent can edit files, summarize progress, skip a failed test, and present the whole thing as complete. In that workflow, honesty is not a soft alignment feature. It is operational reliability. A model that flags uncertainty and admits a blocked path is often more valuable than one that sounds polished while hiding the problem.

Anthropic has also been building memory and dreaming features for Claude Managed Agents. Memory lets agents retain what they learn during a session, while dreaming refines that memory between sessions by finding patterns and removing stale information. Together with multiagent orchestration, the goal is clear: make long-running agent deployments improve over time instead of starting cold whenever the next task begins.

The pricing picture is still complicated

Standard API pricing remains five dollars per million input tokens and twenty-five dollars per million output tokens. Batch API users can still receive a fifty percent discount, and prompt caching can reduce repeated input costs. Fast mode is cheaper than before, but it is still a premium option, and some API access is gated through availability controls. In other words, Opus 4.8 is not suddenly a low-cost model. It is a more efficient frontier model for teams that already have work valuable enough to justify the spend.

For teams already using Opus 4.7, the upgrade path looks relatively direct. The model identifier changes, standard pricing stays flat, and the strongest improvements sit in areas where enterprise buyers actually feel pain: coding reliability, long-context work, and agent coordination. The caveat is that Anthropic itself describes Opus 4.8 as a modest but tangible improvement, not a new model class.

The more important signal is release cadence. GPT-5.5 held the public lead for a short window, Opus 4.8 has taken it back on some measures, and Anthropic is already pointing to Mythos-class models in the coming weeks once additional cyber safeguards are ready. For buyers, the lesson is practical. Do not build your AI strategy around whichever model tops a leaderboard this week. Build around evaluation, orchestration, cost control, and the ability to swap models when the frontier moves again.

Also read: Meta AI chatbot exploit allows hackers to hijack high profile Instagram accounts • Apple shifts artificial intelligence narrative to device level before WWDC • Florida sues OpenAI and CEO Sam Altman over deceptive safety claims