Alibaba's new Qwen models show how far efficiency can stretch

Alibaba's Qwen team has pushed a small but potent MoE model onto the public Terminal-Bench 2.0 leaderboard, and the result is a useful signal for anyone building AI products on a budget.

The headline is not just that Qwen3.6-35B-A3B and Qwen3.5-9B are now on the board. It is that the 35B model, with only 3B active parameters per token, is now sitting in a benchmark that increasingly rewards agents that can actually do work in a terminal, not just answer questions well.

That matters for startups because the compute math changes. A model that carries 35 billion total parameters but activates only a slice of them at inference can offer a much better performance-per-compute tradeoff than a dense model of similar apparent size, which is exactly the kind of shift that makes local deployment and lower-cost cloud inference more realistic for founders.

According to the public Terminal-Bench 2.0 leaderboard, the Qwen3.6-35B-A3B submission from Itay Inbar for Qwen scored 24.6% with a 3.2 margin, while a second run for the same model scored 23.0%. The Qwen3.5-9B entry landed at 9.2%, which makes the gap between the two models obvious even before you look at architecture. The leaderboard was updated on May 14, 2026, and the public snapshot is current as of May 16, 2026.

Qwen's own blog describes Qwen3.6-35B-A3B as an open-weight model with 35 billion total parameters and 3 billion active parameters per token, built around a sparse mixture-of-experts design. The company also says the model uses a hybrid architecture with linear attention and standard gated attention layers, which is the engineering trick that lets it stay relatively efficient while still chasing stronger agentic coding performance.

The practical point is simple. Dense models pay the full cost of every parameter on every token, while sparse MoE models only activate a subset of experts. That does not make them free, but it does mean the inference bill can look very different from the headline parameter count, which is why a 35B MoE can be more attractive than a much larger dense model for startups that care about latency, hosting cost, and predictable margins.

Why founders should care

Terminal-Bench is not a vanity benchmark. It measures how well models handle real terminal workflows, which is closer to what many early-stage AI products need than a pure chat score. If a model can navigate files, run commands, and iterate inside a shell, it can power coding agents, internal ops tools, DevOps helpers, and product features that need more than text generation.

That is why the Qwen result lands differently from a standard model release. The market has spent the last year treating bigger as the default route to better, but this kind of result reinforces a more commercial reality, which is that efficiency is now part of the moat. If two models are close enough in task quality, the one that runs cheaper and ships locally becomes the one a startup can actually use.

The Qwen3.5-9B result also tells a useful story. A 9B model scoring 9.2% is not headline-grabbing in isolation, but it shows that under-10B systems are still in the conversation for narrow workflows, edge deployments, and product tiers where cost matters more than absolute capability. For indie builders, that matters because it expands the menu of viable architectures beyond "rent a huge model from someone else."

Alibaba is not the only lab chasing this path, but it is one of the clearest signs that the industry is moving toward smaller, sharper, more active models instead of endlessly scaling dense ones. That shift changes the economics for everyone downstream. It pressures inference pricing, because customers will expect more capability at lower cost. It also changes VC logic, because foundational model companies can no longer rely on raw size alone as proof of defensibility.

For investors, the message is uncomfortable but familiar. If a 35B MoE model can deliver credible agentic performance while activating just 3B parameters at a time, the value may migrate from brute-force frontier training toward distribution, tooling, orchestration, and niche data advantages. The companies most exposed are the ones whose business model depends on charging a premium for access to capabilities that are becoming cheaper to replicate.

For founders, the opportunity is more immediate. Lower inference cost means more room to experiment with agentic products, more freedom to run models privately, and less need to overbuild around enterprise GPU clusters before product-market fit is clear. That does not erase the need for careful engineering, but it does reduce the penalty for trying. In a market where every dollar of compute can decide whether a startup keeps iterating, that is not a minor improvement, it is a structural one.

Also read: OpenAI's Malta deal turns ChatGPT access into a state-backed utility • If coding is no longer the bottleneck startups must stop hiring for scale and start hiring for ideas • Cisco shows that record AI revenue does not protect jobs