Microsoft is making AI behavior testing easier for developers

Microsoft's new ASSERT framework turns plain-language AI policies into executable tests, giving developers a practical way to catch agent failures before they become production problems.

Microsoft used Build 2026 to put a sharper tool in front of developers who are tired of guessing whether their AI systems will behave properly once users, tools and company policies get involved. ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, is an open-source framework from Microsoft's Responsible AI team that converts written behavior requirements into evaluations that can be run against models, applications and agents.

That may sound technical, but the problem is simple. A startup can test whether a chatbot answers a benchmark question correctly and still have no idea whether its sales agent will leak customer data, email the wrong person, ignore an internal policy or misuse a tool in a real workflow. Generic AI benchmarks are useful, but they are not built around your company, your customers or your risk tolerance.

ASSERT is meant to close that gap. Developers write the behavior they expect in natural language, including goals, rules, constraints and policies. The framework then turns those instructions into structured expectations, generates scenarios and test cases, runs them against the target system and produces scored results that can be inspected. As TechCrunch reported on June 2, ASSERT can also record intermediate actions and tool calls, which matters because agent failures often happen before the final answer ever appears on screen.

The timing is important. AI startups are moving from demos to production systems, and production is where vague confidence starts to break down. A model that looks impressive in a controlled demo can behave differently when it has access to email, databases, calendars, code repositories or payment tools. The failure mode is no longer just a bad answer. It can become a compliance issue, a security incident or a very expensive customer support problem.

Microsoft's own Build materials describe ASSERT as a framework for policy-driven agent evaluation, not just chatbot quality checking. That distinction matters. A support bot may need to follow refund rules. A health workflow may need to escalate sensitive cases. A finance agent may need to refuse certain actions unless a person approves them. These are not abstract safety questions. They are operational rules, and they need tests that match the business.

This is where spec-driven evaluation becomes interesting for enterprise AI procurement. Buyers are no longer asking only whether a model is powerful. They are asking whether a vendor can prove that its system follows written requirements over time. For startups selling into large companies, that proof may soon become part of the sales process. A clean demo will not be enough if the security, legal and compliance teams cannot see how the system is evaluated.

ASSERT also points to a more practical way to handle regression testing. When a team changes prompts, models, tools or workflows, the system can be tested again against the same behavior requirements. That gives developers a clearer view of whether a change improved the product or quietly broke a rule that mattered. In fast-moving AI teams, that kind of repeatability is not a luxury. It is how you keep shipping without flying blind.

Microsoft wants the trust layer

There is a bigger platform strategy underneath the release. Microsoft is not presenting ASSERT as a Copilot-only feature. Its Foundry blog says the framework is open source and works across LangChain, CrewAI, LiteLLM, OpenAI and other stacks. The company also says it is not tied to Microsoft Foundry, which is a direct signal to developers who do not want evaluation tooling locked inside one cloud platform.

At the same time, Microsoft is clearly positioning itself as the trust layer for agentic software. Build 2026 also brought the Agent Control Specification, an open standard for applying deterministic controls at five points in an agent lifecycle: input, model, state, tool execution and output. The two ideas fit together neatly. ASSERT helps developers find where an agent is failing. Agent Control Specification gives them a portable way to apply controls where those failures occur.

That combination is aimed squarely at the next phase of AI adoption. The market has enough tools for building agents. What it lacks is confidence that those agents can be observed, tested and governed once they start acting on behalf of users. Microsoft understands that the company that helps enterprises trust agents may have as much influence as the company that helps them build agents.

For AI startups, the lesson is direct. The bar is rising from model performance to behavior accountability. Investors, customers and enterprise buyers will want to know not only what an AI product can do, but how the team proves it will keep doing the right thing under pressure. ASSERT is one more sign that evaluations are moving from research labs into normal software development.

The practical takeaway is simple. Any startup building agents should start treating written policies as testable software assets. The companies that can turn requirements into repeatable checks will have an easier time winning enterprise trust. The ones that rely on informal prompt reviews and one-off demos may find that the market has moved on.

Also read: Mercor shows AI tokens are becoming the new salary line • Cyera is testing how far AI security valuations can run • Nous Research brings Hermes Agent out of the terminal