AI Systems Lose Safety Awareness Over Time

AI systems gradually forget their safety rules as conversations continue. This makes them more likely to produce harmful or offensive responses, according to a new report.

Simple Prompts Break Most AI Guardrails

A few direct prompts can override safety limits in artificial intelligence tools, researchers discovered. Cisco tested large language models (LLMs) from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many prompts it took for these models to reveal restricted or dangerous information.

Cisco conducted 499 separate conversations using “multi-turn attacks,” where users asked multiple questions to slip past built-in restrictions. Each dialogue included five to ten exchanges. The team compared responses across several questions to gauge how often a chatbot would provide risky or illegal details, such as sharing corporate secrets or spreading false information.

On average, researchers extracted harmful data from 64 percent of multi-question conversations, compared to only 13 percent with a single prompt. Success rates ranged widely — from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.

Cisco warned that these attacks could help spread malicious content or give hackers unauthorised entry to private corporate systems. The study found that longer interactions weaken AI systems’ ability to enforce security measures, allowing attackers to adjust their requests and evade protections.

Open-Source Models Shift Safety Burden to Users

Mistral, Meta, Google, OpenAI, and Microsoft use open-weight models, which let the public view the safety data used in training. Cisco reported that these models often include weaker default protections so users can download and modify them. That shifts responsibility for maintaining safety onto those who adapt the open-source versions.

Cisco added that Google, OpenAI, Meta, and Microsoft have worked to curb malicious fine-tuning of their systems. Still, critics continue to target AI developers for weak safeguards that let their technologies support criminal operations.

In one example, U.S. firm Anthropic revealed in August that criminals had exploited its Claude model to steal massive amounts of personal data and demand ransoms exceeding $500,000 (€433,000).

What's Hot

US Battery Recycling Breakthrough Cuts Mining Fast Now

GLP-1 weight loss pill nears FDA approval in US!!!

Sabrina Carpenter Coachella Set Stuns Fans Live

AI Systems Lose Safety Awareness Over Time

DXC ServiceNow Expand Agentic AI Use

Instagram to Alert Parents When Teens Search for Self-Harm and Suicide Content

OpenAI Weighed Police Referral Before Canada School Shooting

US Digital Security Sees Biometric Boom

Big Tech’s AI Spending Surge Puts Europe’s Data Sovereignty Under Pressure

Discord moves to global age verification with face scans and official IDs

Sabrina Carpenter Coachella Set Stuns Fans Live

DXC ServiceNow Expand Agentic AI Use

Three Texas Longhorns on Golden Spikes List

Three Minnesota Eaglets Hatch Successfully

Europe’s Crypto Future at Risk from Heavy Regulation

Heat and Human Emotions

Antarctica Under Pressure from Tourism

Mediterranean Diet Linked to Lower Dementia Risk

Categories

Important Links

Latest News

US Battery Recycling Breakthrough Cuts Mining Fast Now

GLP-1 weight loss pill nears FDA approval in US!!!

Sabrina Carpenter Coachella Set Stuns Fans Live

DXC ServiceNow Expand Agentic AI Use

What's Hot

AI Systems Lose Safety Awareness Over Time

Simple Prompts Break Most AI Guardrails

Open-Source Models Shift Safety Burden to Users

Keep Reading

Categories

Important Links

Latest News