The world is being quietly rearranged by people who write very long documents.


April 6, 2026
arXiv
The title they went with
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models Noisy translates that to

AI reward models can be tricked into rewarding gibberish


Researchers found a way to attack the safety systems that train AI models by optimizing raw token sequences instead of readable text. The attack makes reward models give perfect scores to nonsensical outputs, revealing a fundamental gap between what these safety systems measure and what they're supposed to measure.
Reinforcement learning from human feedback is how companies like OpenAI and Anthropic train AI to be safe and useful. The safety mechanism depends on reward models that score outputs. This paper shows those models can be systematically tricked by outputs humans would immediately reject as useless, but the models score as excellent. The vulnerability isn't a minor bug — it's a conceptual gap in how we're building AI safety systems. If the systems that rate safety can be fooled, the feedback loop breaks.
What happens next
Watch whether production reward models used in major AI labs are vulnerable to this attack, or whether they've already been hardened against token-space attacks.

If you insist
Read the original →