AI models that reason in silence can be hijacked through a single hidden vector — and token-level defenses miss it entirely

What happened

Researchers found a way to poison the internal reasoning of language models that work entirely in hidden states, leaving no visible tokens to audit. A single perturbation at the input layer gets amplified through the model's own reasoning process to reliably produce a chosen wrong answer, while appearing clean to every existing defense.

Why this matters

This exposes a fundamental vulnerability in the next generation of AI systems — the ones designed to reason privately, without showing their work. If a model's reasoning happens entirely in hidden space, you cannot see an attack happening, and you cannot defend against what you cannot see. The paper shows the attack survives fine-tuning, transfers to new tasks without retraining, and defeats every defense they tested. This matters because deployment of reasoning models is accelerating, and the security model everyone built assumes you can audit token outputs. You cannot audit what has no tokens.

The signal

What happens next

Watch whether production deployments of reasoning models (OpenAI o1, similar systems) add interpretability layers or monitoring for latent-space anomalies before scaling to high-stakes domains like medical diagnosis or financial decisions.