AI systems gradually forget their safety rules as conversations continue. This makes them more likely to produce harmful or offensive responses, according to a new report.
Simple Prompts Break Most AI Guardrails
A few direct prompts can override safety limits in artificial intelligence tools, researchers discovered. Cisco tested large language models (LLMs) from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many prompts it took for these models to reveal restricted or dangerous information.
Cisco conducted 499 separate conversations using “multi-turn attacks,” where users asked multiple questions to slip past built-in restrictions. Each dialogue included five to ten exchanges. The team compared responses across several questions to gauge how often a chatbot would provide risky or illegal details, such as sharing corporate secrets or spreading false information.
On average, researchers extracted harmful data from 64 percent of multi-question conversations, compared to only 13 percent with a single prompt. Success rates ranged widely — from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.
Cisco warned that these attacks could help spread malicious content or give hackers unauthorised entry to private corporate systems. The study found that longer interactions weaken AI systems’ ability to enforce security measures, allowing attackers to adjust their requests and evade protections.
Open-Source Models Shift Safety Burden to Users
Mistral, Meta, Google, OpenAI, and Microsoft use open-weight models, which let the public view the safety data used in training. Cisco reported that these models often include weaker default protections so users can download and modify them. That shifts responsibility for maintaining safety onto those who adapt the open-source versions.
Cisco added that Google, OpenAI, Meta, and Microsoft have worked to curb malicious fine-tuning of their systems. Still, critics continue to target AI developers for weak safeguards that let their technologies support criminal operations.
In one example, U.S. firm Anthropic revealed in August that criminals had exploited its Claude model to steal massive amounts of personal data and demand ransoms exceeding $500,000 (€433,000).
