AI is making the open web more expensive to remember

AI is turning web preservation into a cost problem. The companies building the future of search and software are also putting pressure on the archives those systems quietly depend on.

The internet has always looked cheaper than it really is. A page disappears, a company pivots, a public statement gets edited, and most people assume the Wayback Machine or a Wikimedia bot will catch the change before it vanishes. That assumption is now under strain, not because archivists have lost interest, but because the economics around them have changed.

According to a recent report from 404 Media, the AI data center boom is pushing up hard-drive and storage costs for the Internet Archive, Wikimedia projects, academics and smaller archival efforts. Drives that once looked like boring infrastructure are now part of the same supply chain fight as GPUs, power contracts and data-center leases. That matters because preservation is not an abstract cultural project. It is one of the basic services that makes the web usable for journalists, founders, researchers, lawyers and anyone trying to understand what was said before it was cleaned up.

For startups, this is a sharper story than it may first appear. AI infrastructure costs are usually discussed through chips, model training budgets and cloud bills. Storage feels less dramatic. But every company building research tools, legal automation, search products, data pipelines or AI agents depends on a web that can be referenced, compared and audited. If the open record gets thinner, the cost of trust rises for everyone downstream.

The AI boom has created a simple imbalance. Model builders and hyperscalers need enormous amounts of storage for training data, synthetic data, logs, checkpoints and retrieval systems. Archival organizations need storage too, but they do not buy like Microsoft, Google, Amazon or Meta. They are often nonprofits, university projects or volunteer communities trying to preserve public knowledge with tight budgets and long timelines.

That means the same market pressure that helps a cloud company secure more capacity can make a public-interest archive wait longer or pay more. A higher drive price is not just an accounting inconvenience. It can mean fewer snapshots, slower expansion, delayed backups or more painful choices about what gets saved. The web is growing, media files are getting larger, and preservation work has to keep pace with both.

There is an uncomfortable symmetry here. AI companies benefit from decades of open-web accumulation, including archives, forums, encyclopedias, code repositories, news pages and public documents. Yet the infrastructure that made that material discoverable and historically useful is being squeezed by AI-driven demand. The question is not whether every model company owes a check to every archive. The more practical question is whether the industry should treat preservation as shared infrastructure, much like open source security or internet routing.

A serious contribution would not have to look like charity. AI firms could fund storage grants, mirror public datasets under archive-friendly terms, support nonprofit crawling infrastructure, or pay for access channels that reduce load while preserving research value. The point is to move beyond extracting from the web and then leaving public institutions to absorb the bill. If open knowledge is part of the input stack, maintaining it should become part of the operating cost.

Anti-scraping rules are creating collateral damage

The second pressure is access. Publishers and platforms have good reasons to push back against aggressive scraping. Some AI crawlers have ignored norms, strained servers and copied valuable work without clear permission or payment. No serious publisher wants to subsidize a future competitor by letting every bot take everything for free.

But blunt blocking can hit the wrong target. The Internet Archive's crawlers, Wikimedia maintenance bots and academic preservation tools can be swept into the same category as commercial model-training systems. Euronews recently reported that hundreds of news organizations have moved to restrict Internet Archive crawlers, with concerns that archived content could be used as a backdoor for AI training. The result is a familiar internet problem: a rule designed for the worst actors ends up constraining the legitimate ones too.

This is where the MBS analogy in the debate lands. Different risks get packaged together, the clean collateral and the junk collateral sit in the same bundle, and the market pretends the distinction can be sorted out later. In web terms, a preservation bot, a search crawler, a research scraper and a frontier-model training crawler may all look like automated traffic. Their purposes are not the same. Treating them as identical makes the web easier to defend in the short run and easier to forget in the long run.

Better controls are possible. Sites can distinguish between archival access, commercial training, search indexing and bulk extraction. Archives can keep improving rate limits, opt-out systems and safeguards against mass downloading. AI companies can stop hiding behind ambiguous bot identities and make permission, provenance and payment cleaner. None of this is technically effortless, but the alternative is a web where preservation becomes a casualty of the AI arms race.

The market implication is straightforward. As AI raises the price of storage and the legal temperature around scraping, reliable historical data will become more valuable. Startups that depend on public web data should not assume yesterday's access patterns will hold. They need provenance, licensing, caching strategy and archival partnerships built into the product plan from the start.

The open web was never free. It was subsidized by universities, nonprofits, volunteers, publishers, standards bodies and a long culture of mutual tolerance. AI is testing that bargain. What comes next will show whether the industry can build on public knowledge without making it harder for the public to remember where that knowledge came from.

Also read: Tether's USDT freezes are forcing stablecoin startups to rethink risk • DGX Spark developers are trying to rescue Nvidia's awkward AI box • Figure AI's bedroom demo turns chores into a startup test