The Data Famine

There are no empty shelves. That is why most people miss the famine.

There are no empty shelves. No silent factories. No visible shortage of words, images, videos, posts, documents, tutorials, comments, reviews, code snippets, essays, summaries, or answers. The world appears to be drowning in content. Every search result blooms with pages. Every platform refreshes endlessly. Every blank text box can be filled in seconds. Every company has a blog. Every product has a description. Every question has a generated answer waiting somewhere in the fog.

From a distance, abundance is the obvious condition of the digital world.

But abundance of content is not abundance of origin.

“The world appears to be drowning in content.”

That distinction is the beginning of the data famine.

The machines are not starving because there are too few tokens in the world. They are starving because the tokens that matter — the ones rooted in authentic human encounter with reality — are becoming harder to find, harder to verify, harder to license, harder to separate from imitation, and harder to generate at the scale demanded by the current AI infrastructure boom.

A person can consume calories and still be malnourished. A body can be full and deficient at the same time. The stomach registers abundance while the blood tells another story.

A training corpus can be enormous and still be poor. It can contain trillions of tokens and still lack the specific, rare, expert, fresh, grounded, human-origin signal required to improve real capability. It can grow in volume while declining in value. It can look like a feast to the accounting department and like starvation to anyone who understands what intelligence actually feeds on.

[ References ]

[01]
Villalobos et al. (Epoch AI) — “Will we run out of data? Limits of LLM scaling based on human-generated data”, arXiv:2211.04325 (2024 update) · arxiv.org/abs/2211.04325
[02]
Originality.AI — “AI vs Human Content Study”, Originality.AI (2024) · originality.ai/blog/ai-content-published-monthly-stats
[03]
NewsGuard — “Tracking AI-enabled Misinformation: AI-Generated News Sites”, NewsGuard (2025) · www.newsguardtech.com/special-reports/ai-tracking-center/
[04]
Ayhan Fuat Çelik — “The Fall of Stack Overflow”, Observable (2024) · observablehq.com/@ayhanfuat/the-fall-of-stack-overflow
[05]
Reuters — “Reddit signs AI content licensing deal with Google”, Reuters (2024-02-22) · www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/
[06]
The New York Times — “The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work”, NYT (2023-12-27) · www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html