AI reward models can be tricked into rewarding gibberish

What happened

Researchers found a way to attack the safety systems that train AI models by optimizing raw token sequences instead of readable text. The attack makes reward models give perfect scores to nonsensical outputs, revealing a fundamental gap between what these safety systems measure and what they're supposed to measure.

Why this matters

Reinforcement learning from human feedback is how companies like OpenAI and Anthropic train AI to be safe and useful. The safety mechanism depends on reward models that score outputs. This paper shows those models can be systematically tricked by outputs humans would immediately reject as useless, but the models score as excellent. The vulnerability isn't a minor bug — it's a conceptual gap in how we're building AI safety systems. If the systems that rate safety can be fooled, the feedback loop breaks.

The signal

What happens next

Watch whether production reward models used in major AI labs are vulnerable to this attack, or whether they've already been hardened against token-space attacks.