Arena hit $100M in annualized revenue by letting AI companies pay to be evaluated, and that is exactly the problem

Arena, the UC Berkeley-born AI evaluation platform behind the widely used Chatbot Arena leaderboard, tripled its annualized revenue from $30M to $100M in just six months, but its consumption-based business model and the identity of its paying customers raise questions the company hasn't fully answered.

Eight months is a short time to build a $100 million business. Arena, which only launched its commercial service in September 2025, crossed that annualized run-rate milestone this month, according to reporting by TechCrunch. For context: when the company raised its $150 million Series A at a $1.7 billion valuation back in January, ARR stood at $30 million. The trajectory is steep, and in a crowded AI market it's the kind of number that demands attention.

The company grew out of a UC Berkeley research project, co-founded by postdoctoral students Anastasios Angelopoulos and Wei-Lin Chiang alongside Ion Stoica, the Berkeley professor and Databricks co-founder who advised the project before it incorporated in April 2025. What started as a free, crowdsourced leaderboard for ranking AI models has become infrastructure that model labs and enterprises now pay to access. The commercial product, called AI Evaluations, gives those customers deep-dive analytics drawn from Arena's community of users, who have collectively submitted more than 10 million head-to-head model comparisons.

That community-driven leaderboard is the asset. It's human, distributed, and genuinely hard to replicate. When you want to know whether your new model is actually better than the competition, Arena's blind pairwise ratings from real users carry more weight than a score on a curated benchmark. That's what turned evaluation infrastructure from a footnote into its own category in the AI stack, and it's why investors including Andreessen Horowitz, Kleiner Perkins, Felicis, and Lightspeed have collectively put $250 million into the company.

Here's the tension that doesn't go away just because the revenue numbers are impressive. Arena's paying customers include the same AI companies whose models appear on its leaderboard. OpenAI, Google, Anthropic, and Meta are both funders and participants, effectively paying the referees while competing on the field. Arena insists its evaluation team operates independently, that model submissions are processed blindly, and that the crowdsourced pool is too large and distributed to manipulate. Those are reasonable claims. They're also claims the company has an obvious incentive to make.

The gaming accusations are more concrete. A paper from researchers at Cohere, Stanford, MIT, and AI2 published in April 2025 alleged that Arena allowed a select group of companies to privately test multiple model variants before a public release, then surface only the highest-scoring version on the leaderboard. Meta reportedly tested 27 variants of what would become Llama 4 between January and March, ultimately publicizing the score of just one, which conveniently ranked near the top. Arena has disputed the characterization, but the episode exposed a structural problem: when private pre-release testing is a feature you offer paying clients, you've introduced a mechanism that, even if used in good faith, looks indistinguishable from paying for a favorable placement.

Independence isn't just about preventing active corruption. It's about whether the appearance of independence can survive the business model. Right now, it's struggling.

There's a separate financial question worth keeping in mind. Arena's $100M figure is annualized run-rate revenue, not contracted recurring revenue. The company charges on consumption, meaning customers pay for evaluations as they run them, not on a fixed subscription. That's fine as a monetization approach, but it makes the number more volatile than a clean SaaS ARR figure. A model lab running intensive post-training evaluations in one quarter may spend nothing the next. Investors typically discount consumption-based ARR precisely because the churn dynamics are harder to read.

None of this means Arena's growth isn't real. Evaluation infrastructure filling a genuine gap in the AI development stack is a convincing enough story on its own. The free Chatbot Arena leaderboard has earned real credibility among researchers and practitioners, and that credibility is the moat the commercial product is built on. Eroding it to chase revenue would be a slow form of self-destruction, which is presumably why Arena's founders understand the stakes of getting the independence question right.

But getting it right requires more than asserting it. The conflict-of-interest problem won't be solved by Chinese walls and blind submission processes alone, not when the companies funding you can privately test dozens of model variants before a launch. If Arena wants to hold its position as the most trusted name in AI evaluation, the methodology around private testing needs to be far more transparent, and probably more restricted, than it appears to be today. A leaderboard that determines which AI model is winning the market is too important to leave structurally compromised, even if the numbers on the revenue line look great.

Also read: Cursor's mobile app signals that coding has become a job you supervise, not a desk you sit at • Copper has broken records this year and AI data centers are the reason the rally isn't done • The semiconductor layer is where the real AI money is being made in 2026