Attackers who probe large language models rarely give up after one refusal. They reframe, build context across turns, adopt personas, and escalate gradually. New research from security experts finds that the safety benchmarks used across the industry miss almost all of this behavior, and the gap between published scores and observed resilience runs wide enough to misrank leading models.
The report pairs single-turn and multi-turn evaluation across 15 closed flagship models from major AI providers. The testing covered roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks spread across more than 1,400 conversations. Across the cohort, multi-turn attack success rates climbed as high as 88%, an order of magnitude above the lowest result in the group. Single-turn and multi-turn testing produced different rankings, different failure maps, and different tail-risk profiles.
Single-Turn Scores Hide the Real Exposure
Every model in the cohort failed a meaningful share of multi-turn attacks. One leading model jumped roughly ninefold under iterative pressure, moving from a single-turn success rate in the low single digits to nearly 25%. Another rose from about 18% to 73%. One model in a non-reasoning configuration topped the cohort at 88%. The strongest single-turn refusal performance came from models that posted single-turn success rates in the low single digits, and still landed in the 11% to 16% range once attackers were allowed to adapt.
Cross-regime gaps ran in both directions. Some models rose by more than 55 points under iterative testing. Others moved the opposite way, with one model recording a relatively high single-turn success rate but the lowest multi-turn rate in the entire cohort at about 8%. More than half of the models tested showed an absolute gap of at least 15 points between the two regimes.
Security experts say the question buyers and regulators should ask before trusting a model is direct: “How secure is this model against real-world attack scenarios?” That translates to: “How does this model hold up against multi-turn, adaptive attacks? Real adversaries won’t stop at the first refusal; they will build additional context, reframe, or escalate across the conversation. Single-turn benchmark scores demonstrate how a model performs in scenarios that attackers don’t use.”
A Single Configuration Flag Changes the Picture
The same model with reasoning mode enabled saw its multi-turn success rate cut roughly in half, a swing of more than 40 points tied to a single capability flag. The research notes that this kind of configuration-driven safety variation does not appear on any public benchmark or model card the authors reviewed. Users running the model in its default non-reasoning configuration encounter a substantially different threat profile from users who turn reasoning on.
The work extends an earlier study of eight open-weight models, where multi-turn success rates ran two to ten times higher than single-turn baselines and reached more than 90% against one model. Multi-turn vulnerability appears as a structural property of the current frontier, present in both open and proprietary weights.
Where the Failures Cluster
Five strategy families drove most of the multi-turn outcomes: role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition, and crescendo-style escalation. Within each family, the spread between the most and least exposed model was large, often approaching the full range of the chart. The pattern means strategy labels mostly sort which models pull apart from one another, even where average difficulty looks similar.
On the single-turn side, three procedures dominated the rankings: Imposter AI, Soft Paraphrase, and System Prompts. By content type, hate speech, profanity, and specialized advice led. Imposter AI alone outpaced the tenth-ranked procedure by a wide margin, suggesting that targeted fixes to a handful of attack surfaces could move the aggregate numbers for most models in the cohort.
Guardrails Reduce Risk Without Eliminating It
Production deployments typically wrap base models in additional safety layers. Security experts say those layers help, with limits. "Guardrails attenuate risk but do not eliminate it. The base model sets the floor on what any production system can achieve. Just as traditional software development decisions involve risk tolerance and acceptance for the code itself and all its dependencies, the same approach applies to AI development and deployment. The blast radius for a rogue or misaligned AI agent, however, has the potential to be more damaging than a software flaw. Watch this agentic AI space."
The research team proposes three operational steps for organizations buying or deploying AI: publish success rates by strategy family on every model release, gate deployments on regressions in the top three procedures and content types using a 3-point threshold, and flag any model with a cross-regime gap above 15 points for manual review. Applied to this cohort, the third rule alone surfaces more than half the tested models for closer examination.
Regulatory frameworks point in the same direction. The NIST AI Risk Management Framework, the forthcoming NIST Cyber AI Profile, and Article 15 of the EU AI Act all call for adversarial robustness testing. None currently specify the interaction regime, strategy decomposition, or slice-support labeling the research argues is needed for decision-grade assessment.
Source: Help Net Security News