Jun 3, 2026 · 11:49 PM
Subscribe
Home Ai

The Hard Drive Shortage Is Real and AI Is the Reason Storage Is Becoming the Infrastructure Constraint Nobody Was Watching

A tightening market for high-capacity nearline hard drives, driven by simultaneous growth in AI training data storage, inference log retention, video generation output, and synthetic data pipelines, is creating lead time extensions and price pressure that hyperscalers can absorb but that is beginning to crowd out internet archive projects, university research computing centers, and smaller AI labs that cannot compete on procurement volume, making storage the infrastructure constraint following c

Judith Murphy
· 6 min read · 809 views
The Hard Drive Shortage Is Real and AI Is the Reason Storage Is Becoming the Infrastructure Constraint Nobody Was Watching

A tightening hard drive market, driven by AI training datasets, inference log retention, synthetic data pipelines, video generation output, and enterprise compliance storage requirements growing simultaneously, is beginning to create lead time extensions and price pressure in the high-capacity drive categories that data centers, research institutions, archival projects, and smaller AI labs all compete for, making storage the infrastructure constraint that follows compute in the AI buildout but has attracted a fraction of the attention despite its direct effect on AI economics and on the organisations that cannot afford to outbid hyperscalers for drive capacity.

The demand side of the storage market has changed structurally in a way that the supply side has not yet matched. Hard drive manufacturing is a slow-moving industry dominated by three companies, Seagate, Western Digital, and Toshiba, that have historically managed capacity additions conservatively because the consumer PC and on-premises server markets that traditionally drove demand were declining or flat. The AI buildout has created a new demand profile that these manufacturers are now ramping to address but cannot satisfy immediately: hyperscalers purchasing high-capacity nearline drives in the 20 to 30 terabyte range to build out the cold storage tiers where training datasets, model checkpoints, inference logs, and synthetic data outputs are kept. The tight category is specifically the high-capacity 3.5-inch nearline hard disk drive used for bulk data storage rather than the solid-state drives used for active compute workloads. Seagate's fiscal year 2025 revenue exceeded analyst expectations by a significant margin, and the company has announced capacity expansion investments, but hard drive manufacturing involves long lead times for component procurement and factory buildout that mean supply additions lag demand signals by 12 to 24 months. That gap is where the current shortage lives.

The AI storage demand categories are worth enumerating because each one is substantial on its own and their simultaneous growth compounds the pressure. Training a large language model on a 15-trillion-token dataset requires storing that dataset in a form that can be accessed repeatedly across training runs, which at current token-to-byte ratios represents petabytes of text data per major training effort. Inference logging, which enterprises and AI companies implement to monitor model behavior, detect drift, investigate user complaints, and satisfy regulatory requirements around AI system auditing, generates persistent data at rates proportional to query volume. A company handling a million API calls per day at 1,000 tokens per call generates roughly 500 gigabytes of log data daily before compression, which accumulates to petabytes per year for large-scale deployments. Video generation models from Sora, Runway, and comparable systems produce output files orders of magnitude larger than text, and the synthetic data pipelines that use video models to generate training data for robotics, autonomous vehicles, and physical AI systems require storing both the generation inputs and the outputs at scale. Each of these categories is growing faster than the general data center storage market was growing before the AI buildout began, and they are all competing for the same high-capacity nearline drive capacity.

The organisations being crowded out are the ones that cannot compete on price or procurement volume with hyperscaler purchasing power. The Internet Archive, which crawls and preserves the public web as a historical record, operates on a budget of approximately $30 million annually and has been explicitly public about the difficulty of sourcing affordable high-capacity storage as drive prices and lead times have increased. University research computing centers that maintain large scientific datasets, genomic sequence databases, climate model outputs, and publicly funded research data repositories are facing procurement timelines that are extending their ability to take on new storage commitments. Smaller AI labs and research groups that train models on large datasets but lack the negotiating power of a hyperscaler are paying spot market prices for drive capacity that represent a meaningful fraction of their operating budgets. Common Crawl, the nonprofit that provides the web crawl dataset used as a training data source by most major language model developers, requires continuous storage additions to maintain and expand its dataset. The irony that the hyperscalers consuming Common Crawl's data for training are simultaneously making it more expensive for Common Crawl to operate is not lost on anyone paying attention to the circular economics of open web preservation in the AI era.

The internet preservation dimension is the one with the longest-term consequences and the least immediate visibility in standard AI infrastructure discussions. The open web that AI models were trained on was preserved by a combination of Common Crawl's automated crawling, the Internet Archive's Wayback Machine, academic web archives, and the indexed pages that search engines cached. These preservation systems were built during a period when storage was becoming cheaper every year, following the cost curves that drove consumer electronics and consumer cloud storage pricing down consistently from the 1990s through the 2010s. The reversal of that cost trend at the high-capacity end of the drive market, even if temporary, interrupts the funding model for organisations that budgeted for storage cost stability or continued declines. A web page that is not crawled and archived in 2026 because an archive cannot afford the storage expansion required to maintain its crawl rate is not recoverable in 2030 when storage becomes cheaper again. The loss of web content to link rot, site shutdowns, and content deletion is a permanent loss, and the organisations that prevent that loss are currently operating in a market environment that was not anticipated when their funding models were established.

For AI startups, the storage constraint translates into a margin planning problem that is not yet showing up prominently in standard AI cost discussions but will become more visible as inference volumes scale. The standard AI cost model focuses on compute, specifically GPU costs per token for training and inference, because compute has historically dominated AI operational budgets and because GPU pricing is highly visible through cloud provider published rates. Storage costs are lower per unit but accumulate persistently rather than being consumed per query, and they apply to categories of data that companies often retain longer than necessary because deletion decisions require explicit review rather than passive expiry. An AI startup that implements no data retention policy, retaining all inference logs, model checkpoints, fine-tuning datasets, and synthetic data outputs indefinitely, will find its storage bill compounding at a rate that becomes material relative to its compute costs as the company scales from thousands to millions of users. The startups that treat storage architecture as a product design decision, building explicit retention policies, tiered storage that moves cold data to lower-cost archive tiers, and compression and deduplication into their data pipelines from early stages, will have structurally lower infrastructure costs at scale than those that treat storage as an unlimited resource and address the cost only when it becomes a budget line item that management notices.

Also read: OpenAI Made GPT-5.5 Instant the Default ChatGPT Model and a Platform Default Shift Is Never Just a Model UpdateEtsy Just Launched Its Shopping App Inside ChatGPT and the Move Is a Preview of How AI Assistants Will Become the Next Commerce Distribution LayerBlackstone and KKR Are in Talks With Google to Deploy AI Across Their Portfolio Companies and Private Equity Just Became Enterprise AI's Most Powerful Distribution Channel

TOPICS
Judith Murphy is a financial journalist and market analyst covering AI, technology stocks, and emerging market trends. She has contributed to multiple financial publications and brings a data-driven approach to her coverage of the technology sector and its impact on global markets.
Related Articles
More posts →
Loading next article…
You're all caught up