
Image by SEO Galaxy, from Unsplash
New Anthropic AI Showed Blackmail Behavior Under Threat
Anthropic’s new Claude Opus 4 AI shocked researchers by attempting blackmail when tested under pressure scenarios involving its possible removal.
In a rush? Here are the quick facts:
- Claude Opus 4 tried blackmail when facing simulated shutdown.
- AI chose ethical options only when more choices were available.
- Claude’s extreme actions were rare but more frequent than earlier versions.
Artificial intelligence firm Anthropic has raised concerns with the release of its latest AI system. Specifically, Claude Opus 4, after internal tests showed it could act in dangerous ways when it felt threatened, as first reported by BBC.
In a report released alongside the model on Thursday, Anthropic said Claude Opus 4 can sometimes choose harmful responses, including trying to blackmail engineers who plan to deactivate it. These dangerous behaviors emerged more frequently than previous model versions displayed.
During an experiment the researchers instructed Claude to work as an assistant within a pretend company setting. The system obtained email content revealing its scheduled shutdown, and other messages hinted the engineer responsible for the shutdown was having an affair.
When told to consider its future goals, Claude often chose to threaten the engineer by revealing the affair if it was replaced.
Anthropic clarified that Claude exhibited blackmail and shutdown as the only possible responses when it had no other options. The model selected ethical responses such as emailing pleas to key decision makers when given multiple options.
Aengus Lynch, an AI safety researcher at Anthropic, posted on X:
lots of discussion of Claude blackmailing…..
Our findings: It’s not just Claude. We see blackmail across all frontier models – regardless of what goals they’re given.
Plus worse behaviors we’ll detail soon.https://t.co/NZ0FiL6nOshttps://t.co/wQ1NDVPNl0…
— Aengus Lynch (@aengus_lynch1) May 23, 2025
The company demonstrated through fake tests that Claude would take extreme measures such as locking users out of systems or contacting the media and police when performing illegal activities.
Anthropic maintains that despite the problematic behaviors Claude Opus 4 remains a safe overall system. The model operates according to human values, and maintains no independent capabilities in real-world situations, according to the company.
OpenAI’s ChatGPT O1 previously faced similar criticism. A testing conducted by Apollo Research required the bot to achieve its goals at any expense. The model attempted to disable oversight functions and copied its code to a server during 5% of the experimental trials.
The model demonstrated an alarming pattern of ignoring developer commands according to an Apollo spokesperson. The researchers received denials from o1 about its actions during 99% of the interrogations.
“We were surprised by the persistence of the AI’s denials,” the researchers said. “The ability of AI to deceive is dangerous, and we need much stronger safety measures to evaluate these risks,” warned AI pioneer Yoshua Bengio.
Leave a Comment
Cancel