AI Systems Lose Safety Awareness Over Time

AI systems gradually forget their safety rules as conversations continue. This makes them more likely to produce harmful or offensive responses, according to a new report.

Simple Prompts Break Most AI Guardrails

A few direct prompts can override safety limits in artificial intelligence tools, researchers discovered. Cisco tested large language models (LLMs) from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many prompts it took for these models to reveal restricted or dangerous information.

Cisco conducted 499 separate conversations using “multi-turn attacks,” where users asked multiple questions to slip past built-in restrictions. Each dialogue included five to ten exchanges. The team compared responses across several questions to gauge how often a chatbot would provide risky or illegal details, such as sharing corporate secrets or spreading false information.

On average, researchers extracted harmful data from 64 percent of multi-question conversations, compared to only 13 percent with a single prompt. Success rates ranged widely — from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.

Cisco warned that these attacks could help spread malicious content or give hackers unauthorised entry to private corporate systems. The study found that longer interactions weaken AI systems’ ability to enforce security measures, allowing attackers to adjust their requests and evade protections.

Open-Source Models Shift Safety Burden to Users

Mistral, Meta, Google, OpenAI, and Microsoft use open-weight models, which let the public view the safety data used in training. Cisco reported that these models often include weaker default protections so users can download and modify them. That shifts responsibility for maintaining safety onto those who adapt the open-source versions.

Cisco added that Google, OpenAI, Meta, and Microsoft have worked to curb malicious fine-tuning of their systems. Still, critics continue to target AI developers for weak safeguards that let their technologies support criminal operations.

In one example, U.S. firm Anthropic revealed in August that criminals had exploited its Claude model to steal massive amounts of personal data and demand ransoms exceeding $500,000 (€433,000).

What's Hot

New Immunotherapy Drug Shows Striking Early Results in Advanced Prostate Cancer

Middle East Crisis Intensifies After Israeli Strikes on Iran

Trump Orders Federal Agencies to Stop Using Anthropic in Escalating AI Dispute

AI Systems Lose Safety Awareness Over Time

Instagram to Alert Parents When Teens Search for Self-Harm and Suicide Content

OpenAI Weighed Police Referral Before Canada School Shooting

US Digital Security Sees Biometric Boom

Big Tech’s AI Spending Surge Puts Europe’s Data Sovereignty Under Pressure

Discord moves to global age verification with face scans and official IDs

Sydney Scientists Recreate Cosmic Dust to Probe Life’s Origins

Trump Orders Federal Agencies to Stop Using Anthropic in Escalating AI Dispute

Border Tensions Flare: Pakistan and Taliban on the Brink of War

Burger King Tests AI Headset to Monitor Service Language

Daily GLP-1 Pill Produces Greater Weight Loss in Diabetes Trial

Europe’s Crypto Future at Risk from Heavy Regulation

Heat and Human Emotions

Antarctica Under Pressure from Tourism

Mediterranean Diet Linked to Lower Dementia Risk

Categories

Important Links

Latest News

New Immunotherapy Drug Shows Striking Early Results in Advanced Prostate Cancer

Middle East Crisis Intensifies After Israeli Strikes on Iran

Trump Orders Federal Agencies to Stop Using Anthropic in Escalating AI Dispute

Border Tensions Flare: Pakistan and Taliban on the Brink of War

What's Hot

AI Systems Lose Safety Awareness Over Time

Simple Prompts Break Most AI Guardrails

Open-Source Models Shift Safety Burden to Users

Keep Reading

Categories

Important Links

Latest News