Across medicine, AI, law enforcement, and development economics, systems have been running on untested assumptions for decades. This week, several of those assumptions got their first rigorous tests — and most failed. The harder question is not what the measurements found, but who absorbed the cost of the error while no one was looking.
The structural driver is the same in each case: a system was built, deployed, and evaluated using a proxy metric that was easier to collect than the thing it was supposed to represent. Triage scores stood in for fairness. Benchmark scores stood in for real-world performance. BMI stood in for metabolic risk. The proxies persisted not because they were accurate but because measuring the underlying reality was harder or more politically inconvenient. What remains unknown is how many current policies, medical protocols, and regulatory decisions are still resting on proxies that have simply not yet been tested against the thing they claim to measure.
Track whether the FDA issues a formal methodology for evaluating medical AI update safety within the next 12 weeks — its absence would confirm that the measurement gap in that domain is structural, not accidental.