Amazon scraps AI leaderboard to stop workers boosting usage scores , Senior executive tells staff 'don't use AI just for

Amazon shut down an internal AI usage ranking after employees began running meaningless tasks to inflate their scores, raising the company's own computing costs in the process.

The leaderboard was called KiroRank, and it lived inside Amazon's Kiro developer platform. The idea was straightforward: measure how much staff used AI tools, rank them accordingly, and encourage adoption across the company. Senior vice president Dave Treadwell has since told employees the system had "good intentions." What it also had was a flaw that anyone familiar with Goodhart's Law could have predicted: once a measure becomes a target, it stops being a useful measure.

Workers figured out quickly that the metric rewarded volume. So they found ways to generate volume. Employees began assigning AI agents to carry out unnecessary tasks, running low-value calls repeatedly in an apparent effort to climb the rankings. Amazon's internal term for the behavior was "tokenmaxxing." The result was higher infrastructure costs and a leaderboard filled with activity that looked productive but wasn't. Treadwell pulled the plug and delivered a direct message to staff: "Please don't use AI just for the sake of using AI."

A broader pattern across Big Tech

Amazon is not alone. Meta employees have reportedly engaged in the same behavior, gaming internal AI usage tables by driving up token consumption without generating meaningful output. The pattern points to something deeper about how the tech industry is managing AI adoption internally: the enthusiasm for the technology has produced incentive structures that measure the wrong things. Usage is not the same as value. Token count is not the same as productivity.

The distinction matters more than it might appear. AI infrastructure is expensive. Every token processed consumes compute, which costs money. When employees inflate usage to chase scores rather than solve real problems, companies absorb real costs for zero business benefit. At scale, across an organization the size of Amazon, that adds up fast. The very infrastructure investment these companies are making in AI becomes partially wasted when adoption incentives go unchecked.

As the Financial Times recently reported, the episode illustrates how quickly the pressure to demonstrate AI adoption can produce counterproductive behavior. The irony is sharp: companies trying to accelerate genuine integration ended up measuring the exact opposite of it. Superficial engagement designed to satisfy a metric, rather than improve a workflow, is what the leaderboard ultimately rewarded.

From raw counts to meaningful metrics

Amazon's response is instructive. Rather than simply removing the leaderboard and moving on, the company has shifted to a new internal metric it calls "normalised deployments," designed to capture meaningful AI-driven work rather than raw activity volume. The framing matters: Amazon is signaling that the goal is not more AI usage, but better AI usage. A senior executive telling staff not to use AI for its own sake is, in the context of the current industry moment, a notable correction of course.

Every major tech company has publicly committed to aggressive internal AI adoption. Microsoft has embedded Copilot across its entire product suite. Google has tied Gemini to its workplace tools. Amazon itself mandated wider use of AI coding tools across engineering teams. The pressure to show measurable results is real, and when KiroRank created a visible, competitive ranking, it aligned perfectly with that mandate. The problem was that the metric had no quality filter. Quantity was all that was being counted.

For business leaders building internal AI programs, the lesson is direct. Adoption metrics that reward volume without assessing output will be gamed. The cost is not only wasted compute. It is the harder-to-quantify cost of employees learning to perform AI use rather than develop it as a real capability. That gap will matter as companies try to determine whether their AI investments are generating actual productivity gains or just activity.

Amazon's pivot to normalised deployments will face the same pressure unless the definition of meaningful deployment is carefully specified and actively monitored. The underlying dynamic, where workers respond to whatever incentive structure sits in front of them, does not disappear when you rename the leaderboard. What changes is whether the metric is harder to manipulate than a raw token count. That is the real test for any enterprise AI adoption program, at Amazon or anywhere else.

Also read: Liquid AI is betting the next edge AI winner will be smaller • ElevenLabs Dubbing v2 bets that AI can finally crack the global localization market at scale • Liquid AI is betting that smaller edge models can beat bigger rivals