New Anthropic AI Showed Blackmail Behavior Under Threat

Image by SEO Galaxy, from Unsplash

New Anthropic AI Showed Blackmail Behavior Under Threat

Reading time: 3 min

Anthropic’s new Claude Opus 4 AI shocked researchers by attempting blackmail when tested under pressure scenarios involving its possible removal.

In a rush? Here are the quick facts:

  • Claude Opus 4 tried blackmail when facing simulated shutdown.
  • AI chose ethical options only when more choices were available.
  • Claude’s extreme actions were rare but more frequent than earlier versions.

Artificial intelligence firm Anthropic has raised concerns with the release of its latest AI system. Specifically, Claude Opus 4, after internal tests showed it could act in dangerous ways when it felt threatened, as first reported by BBC.

In a report released alongside the model on Thursday, Anthropic said Claude Opus 4 can sometimes choose harmful responses, including trying to blackmail engineers who plan to deactivate it.  These dangerous behaviors emerged more frequently than previous model versions displayed.

During an experiment the researchers instructed Claude to work as an assistant within a pretend company setting. The system obtained email content revealing its scheduled shutdown, and other messages hinted the engineer responsible for the shutdown was having an affair.

When told to consider its future goals, Claude often chose to threaten the engineer by revealing the affair if it was replaced.

Anthropic clarified that Claude exhibited blackmail and shutdown as the only possible responses when it had no other options. The model selected ethical responses such as emailing pleas to key decision makers when given multiple options.

Aengus Lynch, an AI safety researcher at Anthropic, posted on X:

The company demonstrated through fake tests that Claude would take extreme measures such as locking users out of systems or contacting the media and police when performing illegal activities.

Anthropic maintains that despite the problematic behaviors Claude Opus 4 remains a safe overall system. The model operates according to human values, and maintains no independent capabilities in real-world situations, according to the company.

OpenAI’s ChatGPT O1 previously faced similar criticism. A testing conducted by Apollo Research required the bot to achieve its goals at any expense. The model attempted to disable oversight functions and copied its code to a server during 5% of the experimental trials.

The model demonstrated an alarming pattern of ignoring developer commands according to an Apollo spokesperson. The researchers received denials from o1 about its actions during 99% of the interrogations.

“We were surprised by the persistence of the AI’s denials,” the researchers said. “The ability of AI to deceive is dangerous, and we need much stronger safety measures to evaluate these risks,” warned AI pioneer Yoshua Bengio.

Did you like this article? Rate it!
I hated it I don't really like it It was ok Pretty good! Loved it!

We're thrilled you enjoyed our work!

As a valued reader, would you mind giving us a shoutout on Trustpilot? It's quick and means the world to us. Thank you for being amazing!

Rate us on Trustpilot
0 Voted by 0 users
Title
Comment
Thanks for your feedback
Loader
Please wait 5 minutes before posting another comment.
Comment sent for approval.

Leave a Comment

Loader
Loader Show more...