Part II · The Wall
Chapter 06 · 13 min read

The Benchmark Theater

When the metric becomes the target, the metric stops being a measurement.

The most dangerous stage in the life of a metric is the moment people begin to worship it.

At first, a metric is humble. It is a tool. It helps people see something they could not otherwise see. A temperature reading helps diagnose fever. A speedometer helps a driver avoid danger. A credit score helps lenders estimate risk. A standardized test helps compare performance across students, schools, or systems.

Careers depend on it. Funding depends on it. Headlines depend on it. Promotions depend on it. Valuations depend on it. Once that happens, the metric stops merely measuring behavior and begins shaping behavior. People optimize for the number because the number has become the thing that matters.

The AI industry is living inside Goodhart's Law at planetary scale.

This is the lesson known as Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

The AI industry is living inside Goodhart's Law at planetary scale.

Its miracles are percentage-point improvements.

Its prophets announce state-of-the-art results.

[ References ]
  1. [01]
    Goodhart, C. — “Problems of Monetary Management: The U.K. Experience, Papers in Monetary Economics, RBA (1975) · www.rba.gov.au/publications/confs/1981/goodhart.html
  2. [02]
    Zhou et al. — “Don't Make Your LLM an Evaluation Benchmark Cheater, arXiv:2311.01964 (2023) · arxiv.org/abs/2311.01964
  3. [03]
    Sainz et al. — “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination, arXiv:2310.18018 (2023) · arxiv.org/abs/2310.18018
  4. [04]
    LMSYS — “Chatbot Arena: blind human-preference leaderboard, LMSYS (2024–2025) · lmarena.ai/leaderboard