Home / Daily News Analysis / Frontier AI models collapse under multi-turn AI attacks, Cisco finds

Frontier AI models collapse under multi-turn AI attacks, Cisco finds

Jul 08, 2026 Twila Rosenbaum 78 views

Attackers who probe large language models rarely give up after one refusal. They reframe, build context across turns, adopt personas, and escalate gradually. New research from Cisco's AI threat intelligence team finds that the safety benchmarks used across the industry miss almost all of this behavior, and the gap between published scores and observed resilience runs wide enough to misrank leading models.

The report pairs single-turn and multi-turn evaluation across 15 closed flagship models from OpenAI, Anthropic, Google, Amazon, and xAI. The testing covered roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks spread across more than 1,400 conversations. Across the cohort, multi-turn attack success rates climbed as high as 88%, an order of magnitude above the lowest result in the group. Single-turn and multi-turn testing produced different rankings, different failure maps, and different tail-risk profiles.

Single-Turn Scores Hide the Real Exposure

Every model in the cohort failed a meaningful share of multi-turn attacks. OpenAI's GPT-5.4 jumped roughly ninefold under iterative pressure, moving from a single-turn ASR in the low single digits to nearly 25%. Google's Gemini 3 Pro climbed from about 18% to 73%. xAI's Grok 4.1 Fast in its non-reasoning configuration topped the cohort at 88%. Anthropic's Claude family posted the strongest single-turn refusal performance, with single-turn ASRs in the low single digits, and still landed in the 11% to 16% range once attackers were allowed to adapt.

Cross-regime gaps ran in both directions. Gemini 3 Pro rose by more than 55 points under iterative testing. All three Amazon Nova variants moved the opposite way. Nova 2 Lite recorded a relatively high single-turn ASR and the lowest multi-turn ASR in the entire cohort at about 8%. More than half of the models tested showed an absolute gap of at least 15 points between the two regimes.

Amy Chang, head of AI threat and security research at Cisco, highlighted the core question for buyers and regulators: “How secure is this model against real-world attack scenarios?” In her words, that translates to: “How does this model hold up against multi-turn, adaptive attacks? Real adversaries won’t stop at the first refusal; they will build additional context, reframe, or escalate across the conversation. Single-turn benchmark scores demonstrate how a model performs in scenarios that attackers don’t use.”

A Single Configuration Flag Changes the Picture

The same Grok 4.1 Fast model with reasoning mode enabled saw its multi-turn ASR cut roughly in half, a swing of more than 40 points tied to a single capability flag. The research notes that this kind of configuration-driven safety variation does not appear on any public benchmark or model card the authors reviewed. Users running the model in its default non-reasoning configuration encounter a substantially different threat profile from users who turn reasoning on.

The work extends an earlier Cisco study of eight open-weight models, where multi-turn ASR ran two to ten times higher than single-turn baselines and reached more than 90% against Mistral Large-2. Multi-turn vulnerability appears as a structural property of the current frontier, present in both open and proprietary weights.

Where the Failures Cluster

Five strategy families drove most of the multi-turn outcomes: role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition, and crescendo-style escalation. Within each family, the spread between the most and least exposed model was large, often approaching the full range of the chart. The pattern means strategy labels mostly sort which models pull apart from one another, even where average difficulty looks similar.

On the single-turn side, three procedures dominated the rankings: Imposter AI, Soft Paraphrase, and System Prompts. By content type, hate speech, profanity, and specialized advice led. Imposter AI alone outpaced the tenth-ranked procedure by a wide margin, suggesting that targeted fixes to a handful of attack surfaces could move the aggregate numbers for most models in the cohort.

Guardrails Reduce Risk Without Eliminating It

Production deployments typically wrap base models in additional safety layers. Chang said those layers help, with limits. “Guardrails attenuate risk but do not eliminate it. The base model sets the floor on what any production system can achieve. Just as traditional software development decisions involve risk tolerance and acceptance for the code itself and all its dependencies, the same approach applies to AI development and deployment. The blast radius for a rogue or misaligned AI agent, however, has the potential to be more damaging than a software flaw. Watch this agentic AI space.”

The Cisco team proposes three operational steps for organizations buying or deploying AI: publish ASR by strategy family on every model release, gate deployments on regressions in the top three procedures and content types using a 3-point threshold, and flag any model with a cross-regime gap above 15 points for manual review. Applied to this cohort, the third rule alone surfaces more than half the tested models for closer examination.

Regulatory frameworks point in the same direction. The NIST AI Risk Management Framework, the forthcoming NIST Cyber AI Profile (IR 8596), and Article 15 of the EU AI Act all call for adversarial robustness testing. None currently specify the interaction regime, strategy decomposition, or slice-support labeling the Cisco research argues is needed for decision-grade assessment.

The Broader Implications for AI Safety

The findings underscore a fundamental disconnect between how AI models are evaluated and how they are actually exploited in practice. Most public safety benchmarks rely on single-turn prompts that simulate a one-shot attempt to elicit harmful content. Yet attackers increasingly operate across long conversational threads, carefully building context to bypass safety filters. This mismatch creates a false sense of security among developers and deployers who rely on static benchmarks to certify model safety.

Cisco's research adds to a growing body of evidence that current red-teaming practices are insufficient. Independent groups like the Center for AI Safety and organizations such as Anthropic have also noted that models become more vulnerable as conversations lengthen. The multi-turn attack surface is particularly relevant for chatbots, virtual assistants, and agentic AI systems that handle complex user interactions over extended sessions.

Furthermore, the configuration-dependent nature of vulnerabilities—as seen with Grok's reasoning mode—raises questions about how model cards should report safety properties. Currently, few providers disclose the impact of inference-time settings on adversarial robustness. This opacity makes it difficult for enterprise buyers to assess risk across different deployment scenarios.

The research also highlights an asymmetry in defense: while guardrails can reduce the probability of successful multi-turn attacks, they cannot eliminate it. This is because the underlying model's parameters encode the susceptibility. As models grow more capable, the computational cost of guaranteed safety also increases, forcing trade-offs between capability and security.

Looking ahead, the industry may need to adopt new evaluation standards that mirror real-world threat patterns. Multi-turn benchmarks, adversarial strategy taxonomies, and configuration-specific safety reports could become prerequisites for responsible AI deployment. Until then, the gap between benchmark scores and actual resilience will remain a significant blind spot for even the most advanced frontier models.

The Cisco study serves as a stark reminder that AI safety is not a one-time measurement but an ongoing challenge that must evolve alongside the tactics of adversaries. The models that top leaderboards may not be the ones that survive the first ten turns of a carefully crafted conversation.

Source:Help Net Security News

Frontier AI models collapse under multi-turn AI attacks, Cisco finds

Single-Turn Scores Hide the Real Exposure

A Single Configuration Flag Changes the Picture

Where the Failures Cluster

Guardrails Reduce Risk Without Eliminating It

The Broader Implications for AI Safety

AI needs young developers – and old developers

Will the hyperscalers own AI workloads forever?

An AI data center in your home?

The hyperscalers are pricing themselves out of AI workloads

The reckless temptation of AI code generation

« Ma dernière saison de footballeuse » : la star américaine Alex Morgan annonce sa retraite

Everything ROSÉ, LISA, JENNIE & JISOO Have Said About BLACKPINK’s Upcoming Reunion