AI agents are starting to do real research math

Google DeepMind’s AlphaProof Nexus shows that AI agents are moving from benchmark tests into the harder business of producing verifiable scientific work.

The interesting part is not that an AI system solved some math problems. We have seen enough contest scores and leaderboard jumps to know that frontier models can look impressive under controlled conditions. The important part is that AlphaProof Nexus was pointed at open mathematical problems, asked to produce formal proofs, and came back with results that can be checked by software rather than trusted on style.

That changes the conversation. A benchmark tells you whether a model can answer questions with known answers. Open research asks whether it can help create something new. For founders, investors, and research-heavy companies, that is a much more useful test of where AI agents are heading.

According to the arXiv paper from Google DeepMind researchers published on May 21, AlphaProof Nexus autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 conjectures from the Online Encyclopedia of Integer Sequences. The team also said the system is being used in work across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The formal proof files have been published in a Google DeepMind GitHub repository, alongside natural-language proof explanations.

Mathematics is a useful place to test AI because the difference between sounding right and being right is brutal. A polished paragraph does not matter if the argument fails. This is where AlphaProof Nexus is different from ordinary chatbot-style reasoning. It works with Lean, a formal proof assistant that checks each logical step. If a proof does not compile, it does not count.

That does not make the system magic. It solved only a small share of the Erdős problems it attempted. Two of the solved problems had been open for 56 years, which is eye-catching, but the hit rate still shows how narrow the capability remains. The point is not that mathematicians have been replaced. The point is that an autonomous system can now search through formal proof space, get rejected by a verifier, adjust, and eventually produce machine-checkable output on real unsolved problems.

This is exactly the kind of loop that matters in commercial AI. The best agents will not simply generate text. They will act, test, fail, correct themselves, and produce artifacts that external systems can verify. In software, that verifier might be a compiler or test suite. In finance, it might be a risk engine. In science, it could be a simulation, lab protocol, or formal proof environment.

That is why the cost detail matters. DeepMind’s paper says the most capable agent operated at a per-problem inference cost of a few hundred dollars. That is not trivial, but it is low enough to make long-running research agents feel less like a trophy demo and more like an emerging operating model. If a company can spend hundreds or thousands of dollars to explore a technical problem that would otherwise consume weeks of specialist time, the economics start to look very different.

Research tools are becoming infrastructure

There is a business lesson hiding inside the math. The next market for AI is not just chat subscriptions or coding assistants. It is specialized research infrastructure built around agents, verifiers, domain libraries, and compute budgets. DeepMind has Gemini. OpenAI has been pushing reasoning models into scientific discovery. Anthropic has made Claude Code a serious agentic work tool. The common thread is the same: models become more valuable when they are connected to environments that can judge their work.

This also raises a pricing question. If research agents can run for hours or days, the old per-seat software model starts to look incomplete. Customers will care less about access and more about outcomes, compute limits, auditability, and review workflows. A startup using agents for chip design, drug discovery, materials science, or legal research will want to know what a run costs, how errors are caught, who signs off, and whether results can be reproduced.

That creates room for new companies. Some will build agent orchestration layers for research teams. Some will build verification tools in specific domains. Others will package formal methods, simulation engines, and curated datasets so that frontier models can do useful work without wandering through unreliable context. The winner may not always be the lab with the largest model. It may be the company that gives the model the best environment to work inside.

There is still a hard human problem here. Mathematical results need interpretation, taste, and judgment. A proof can be correct and still not be important. A system can rediscover prior work, miss context, or produce results that are formally valid but strategically uninteresting. The human role shifts toward choosing the right problems, deciding which outputs deserve attention, and turning a proof or discovery into something useful.

The practical takeaway is simple. Any business built around technical knowledge should start watching autonomous research agents now, even if the immediate use case is not mathematics. AlphaProof Nexus is still early, narrow, and dependent on formal verification. But it points to a market where AI systems do not just summarize the frontier. They help push it forward, one checked result at a time.

Also read: AI still has not solved software pricing, and Snowflake knows it • Rivian says AI will make CarPlay less important in its EVs • Rivian is betting AI can make CarPlay feel unnecessary