Jun 18, 2026 · 11:08 PM
Subscribe
Home Ai

AI models are hitting a data quality wall and the open web is the reason why

Fortune's reporting on the deteriorating quality of public web data used to train AI models has surfaced a structural problem the industry has been slow to confront: the open web is increasingly composed of synthetic, recycled, and SEO-optimized content that degrades model performance when consumed at scale. Researchers have documented model collapse effects from AI-generated training contamination, and the problem compounds across successive training generations. For startups, the implication i

Janet Harrison
· 6 min read · 319 views
AI models are hitting a data quality wall and the open web is the reason why

The training substrate that built the current generation of AI models is deteriorating, and Fortune's reporting on the data quality crisis facing labs and startups alike makes clear that more data is no longer a reliable answer to better performance.

For most of the past five years, the working assumption in AI development was straightforward: larger datasets and bigger crawls produce better models. That assumption is now breaking down in ways that researchers have been warning about quietly for some time and that Fortune's recent reporting has brought into sharper focus. The open web, which served as the primary training corpus for nearly every major language model, is increasingly composed of low-value, recycled, synthetic, and SEO-optimized material that degrades rather than improves model quality when consumed at scale. The problem is not just that there is more noise. It is that the signal-to-noise ratio is deteriorating structurally, and the mechanisms driving that deterioration are self-reinforcing.

The dynamic researchers refer to as model collapse is the most concerning version of this problem. When AI-generated text becomes a significant portion of the training data for subsequent AI models, the output distribution of those models narrows. Rare but important information gets underrepresented across successive training cycles. The model learns the statistical average of AI output rather than the full distribution of human knowledge and expression, and quality degrades in ways that are difficult to detect on standard benchmarks but visible in edge cases and specialized queries. A model trained partly on outputs from earlier models is, in a meaningful sense, learning from its own reflection rather than from the world.

The open web's composition has shifted enough to make this a near-term operational concern rather than a theoretical future risk. Common Crawl, the nonprofit web archive that underpins training datasets for models across the industry, ingests whatever the public internet contains. As AI-generated content has proliferated across blogs, news aggregators, product descriptions, forum responses, and social platforms, Common Crawl has begun reflecting that proliferation. Estimates of AI-generated content on the indexed web vary, but the directional trend is unambiguous. The data is getting noisier faster than curation methods are getting better at filtering it.

Founders building AI products today are largely working on top of base models that were trained on data assembled before the current wave of synthetic content saturated the web. The next generation of models will not have that advantage. Labs training on fresh crawls in 2025 and 2026 are contending with a meaningfully worse data environment than labs training on 2022 and 2023 crawls, and the quality gap may not be immediately visible in headline benchmark scores, which are themselves susceptible to contamination and optimization pressure.

The more immediate risk for startups is in fine-tuning and retrieval-augmented generation pipelines that pull from live web sources. A product that retrieves current information to ground its responses is only as good as the current information available to retrieve. If that information layer is increasingly composed of AI-generated summaries, recycled listicles, and keyword-optimized filler, the retrieved context degrades the output regardless of how capable the base model is. This is a data pipeline problem that model improvements cannot fix, because the problem lives upstream of the model in the data selection and curation layer.

Research published over the past year from teams at MIT, Oxford, and several AI labs has demonstrated empirically that even small proportions of model-generated text in training corpora can produce measurable degradation in output quality, particularly on tasks requiring precise factual recall or stylistic diversity. The degradation compounds across training generations. A five percent contamination rate in one generation becomes a higher effective rate in the next if the generated outputs from the contaminated model are themselves crawled and included in future datasets.

Where the moat is shifting

The practical response for startups is to treat data provenance as a first-class engineering concern rather than an afterthought. This means knowing where fine-tuning data comes from, how it was collected, what filtering was applied, and whether the source is likely to have already been contaminated by synthetic material. It means building evaluation suites that test specifically for the failure modes associated with low-quality training data: overconfident generation on obscure topics, reduced stylistic range, and degraded performance on queries that require genuine specificity rather than statistical plausibility.

It also means taking proprietary dataset development seriously as a strategic investment. A company that has assembled a curated, provenance-verified corpus of domain-specific content, clinical notes, legal filings, engineering documentation, financial disclosures, or any other category where quality and accuracy are non-negotiable, holds something that cannot be replicated by crawling the public web at any scale. As the open web gets noisier, the relative value of clean proprietary data increases. The competitive moat in AI is quietly shifting from who has the biggest model to who has the cleanest data for the specific tasks that matter in their vertical.

The labs with the resources to address this at scale are already moving toward licensed data partnerships, synthetic data generation with careful quality controls, and human-in-the-loop curation pipelines that would have seemed prohibitively expensive two years ago but now look like necessary infrastructure. Startups without those resources need to be selective about where they invest in data quality rather than trying to match the scale play. Going deep on a narrow, well-curated domain is more defensible than going wide on a broad but contaminated one. The founders who understand that the data quality problem is now a competitive variable, not just a technical nuisance, will be better positioned for the next phase of model development than those still operating on the assumption that the web is an infinite, reliable training resource. It was, for a while. It is less so every month.

Also read: Derrick Downey built a number one App Store hit with Claude and no coding experience and the template he used is sitting there for anyone willing to try itFree API credits are building the AI startup ecosystem and that is a more serious problem than it soundsWhen companies blame AI for layoffs that were really about bad bets and weak demand they are borrowing credibility they have not earned

TOPICS
Janet Harrison has over 16 years experience in the financial services industry giving her a vast understanding of how news affects the financial markets, and an early adopter of blockchain technology and digital currencies. Janet is an active holder and trader spending the majority of her time analyzing blockchain projects, reports and watching new and upcoming projects and other initiatives in the industry. She has a Masters Degree in Economics with previous roles counting Investment Banking.
Related Articles
More posts →
Loading next article…
You're all caught up