Gemini 3.5 Flash takes the lead in Zapier's AutomationBench

Google's latest Flash model has moved from fast and cheap to front-line agentic muscle.

Gemini 3.5 Flash has taken the top spot on Zapier's AutomationBench, a benchmark built to test whether models can actually complete real business workflows across sales, marketing, operations, support, finance, and HR. That matters because Zapier's scorecard is not about polished chatbot answers, it is about whether an AI model can do the work inside live automation pipelines.

The result lands at a useful moment for Google. At I/O on May 19, the company presented Gemini 3.5 Flash as its "frontier performance for agents and coding," and said the model is now the default in the Gemini app and AI Mode in Search, with enterprise access through Gemini Enterprise and the Gemini API. In other words, Google is no longer selling Flash as a trimmed-down speed model, but as an agentic workhorse built for multi-step tasks and longer workflows, according to Google's launch post and model card.

Zapier's AutomationBench was designed to measure end-to-end workflow execution with real tools and deterministic grading, which gives it more practical weight than many benchmark leaderboards. The benchmark uses a public/private split and scores the final environment state against fixed success criteria, which is exactly the kind of test enterprise buyers and startup builders care about when they are wiring models into automation stacks. Zapier said the benchmark reflects use cases drawn from the 3.7 million companies and 2 billion monthly tasks it sees across its platform.

That makes Gemini 3.5 Flash's new result more than a branding win. If a model can reliably move through a messy workflow, handle branching logic, and keep state across multiple steps, it becomes easier to imagine it being used for onboarding, routing, finance ops, or support triage rather than just drafting text. Zapier's own benchmark page says AutomationBench is meant to test real workflow orchestration, not vibes, and its leaderboard has become a shortcut for seeing which models are most usable in automation-heavy environments.

There is also a competitive layer here. Zapier's public benchmark had previously shown OpenAI models and Anthropic models near the top on its lead scores, which is why Google moving into the number one slot is notable. Even before this result, the same benchmark showed a tightly packed field, with small differences in score carrying outsized meaning for builders deciding which model sits behind a customer-facing workflow or internal agent.

Google's agentic push

Google has been making the same argument all week: the era of judging models only by speed or raw chat quality is fading, and agentic execution is the more important test. Its May 19 announcement said Gemini 3.5 Flash is built for complex, long-horizon tasks and can help with software development, financial document preparation, customer onboarding, OCR, tax workflows, and data diagnostics. The company also claimed it runs at four times the output-token speed of other frontier models, while retaining the kind of capability usually associated with bigger flagship systems.

That framing helps explain why a benchmark like AutomationBench is strategically useful for Google. The model card says Gemini 3.5 Flash is suited to agentic workflows, coding tasks, and multi-week enterprise processes, and Google highlighted early partner use cases from Shopify, Macquarie Bank, Salesforce, Ramp, Xero, and Databricks. The message is clear enough. Flash is being positioned as the model you hand the tedious work to when the task is not a single prompt but a chain of decisions, retrievals, and tool calls.

For Google, winning a workflow benchmark also supports the broader product story around Gemini being embedded across Search, mobile, developer tools, and enterprise systems. That is important because enterprise AI buying is increasingly shifting away from novelty demos and toward systems that save time inside existing infrastructure. A model that scores well on automation can become the quiet layer sitting underneath dozens of business processes, which is exactly where vendors want to be.

Pressure on OpenAI and Anthropic

The bigger market implication is that Google is pressing directly on OpenAI and Anthropic in the segment that may matter most for the next phase of AI adoption. Consumer chat still gets attention, but enterprise workflows are where budgets, retention, and long-term developer loyalty begin to harden. If Gemini 3.5 Flash can keep showing strength on agentic benchmarks, the competitive conversation shifts from who writes the best answer to who completes the job with the least friction.

That is where Zapier's benchmark is especially uncomfortable for rivals. The point of automation software is not to impress on a single prompt, but to handle repetitive work without breaking when conditions change. A model that can hold state, use tools, and finish workflows cleanly has a better chance of being pulled into production by builders who are tired of testing models that look good in a demo and wobble in deployment.

There is still room for caution. Benchmarks are helpful, but they are not the same as the chaos of production systems, where permissions, edge cases, and brittle third-party APIs can matter more than leaderboard placement. Even so, the combination of Google's new positioning and a strong AutomationBench result gives Flash a sharper identity than it had a week ago. It now looks less like a cheap variant and more like a direct challenge for the agent layer of enterprise AI.

","excerpt":"Gemini 3.5 Flash has moved into a more serious category.

Also read: Kalshi's new backing shows prediction markets have won over Wall Street • Anthropic's path to profit is coming into view • White House AI cyber order could tighten rules for federal deployments