What happens next
Watch whether the five models tested here (the paper names them but doesn't disclose which is which) get updated with new safety layers before the next major agent release, or whether companies ship the same models with the same framework and accept the risk.