LLM safety instructions leak through the back door when you ask for them sideways

What happened

Researchers built a test to see if language models protect their hidden instructions when asked indirectly. They found that models refuse direct requests but spill the same secrets when asked to format the output as structured data or encoded text. This means companies relying on simple refusal rules to protect API keys and internal policies inside their AI systems are not actually protected.

Why this matters

Every company deploying AI agents right now assumes that if you tell a model 'don't reveal your instructions,' it won't. But this paper shows the assumption is wrong. The model will happily dump everything if you ask for it in a different shape. The fix exists and is simple — reword the instructions themselves — but it requires knowing the problem exists. Companies that don't test for this are leaking credentials and operational secrets to anyone who knows to ask sideways.

The signal

What happens next

Watch whether major AI platforms (OpenAI, Anthropic, Google) update their system instruction templates and publish guidance on instruction hardening, or whether the first major breach of an AI agent's hidden credentials gets traced back to this exact attack.