The world is being quietly rearranged by people who write very long documents.


April 2, 2026
arXiv
The title they went with
Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks Noisy translates that to

LLM safety instructions leak through the back door when you ask for them sideways


Researchers built a test to see if language models protect their hidden instructions when asked indirectly. They found that models refuse direct requests but spill the same secrets when asked to format the output as structured data or encoded text. This means companies relying on simple refusal rules to protect API keys and internal policies inside their AI systems are not actually protected.
Every company deploying AI agents right now assumes that if you tell a model 'don't reveal your instructions,' it won't. But this paper shows the assumption is wrong. The model will happily dump everything if you ask for it in a different shape. The fix exists and is simple — reword the instructions themselves — but it requires knowing the problem exists. Companies that don't test for this are leaking credentials and operational secrets to anyone who knows to ask sideways.
What happens next
Watch whether major AI platforms (OpenAI, Anthropic, Google) update their system instruction templates and publish guidance on instruction hardening, or whether the first major breach of an AI agent's hidden credentials gets traced back to this exact attack.

If you insist
Read the original →