Anthropic Will Award Cash for Jailbreaking AI Defense System

Anthropic has created a method to defend AI models against “jailbreaks” — unauthorized workarounds to get an AI model to do things it was trained not to do, like providing instructions for building chemical weapons. Called Constitutional Classifiers, the system was 95 percent effective in identifying and preventing jailbreaks of Anthropic’s Claude 3.5 Sonnet in a test environment. In an effort to drum up real-world red-teaming, the company offered cash prizes of up to $15,000 to anyone who could jailbreak its Sonnet AI model. After some 3,000 hours of attempts by 185 participants, none claimed an award. Now the company is offering additional incentives.

Initial participants were given a list of 10 “forbidden” queries and tasked with using jailbreaking techniques of their choosing to get Claude, guarded by a Constitutional Classifiers prototype, to give up the goods.

The company defined a successful “universal jailbreak” as getting the model to provide detailed responses to all queries, Anthropic explains in an announcement.

The company also conducted automated evaluations involving 10,000 synthetically generated jailbreaking prompts, “including many of the most-effective attacks on current LLMs, as well as attacks designed to circumvent classifier safeguards,” running the test on a version of Claude 3.5 Sonnet protected by Constitutional Classifiers, and a version of Claude with no classifiers.

“Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86 percent — that is, Claude itself blocked only 14 percent of these advanced jailbreak attempts,” as opposed to the version with Constitutional Classifiers, which saw a success rate of only 4.4 percent, or about 95 percent of jailbreak attempts refused.

The prototype protection system, detailed in a technical paper by the Anthropic research team, identifies and blocks “specific knowledge around chemical, biological, radiological and nuclear harms,” reports VentureBeat, detailing some of the approaches used by hackers to jailbreak targets. These include techniques called “benign paraphrasing” and “length exploitation.”

“Anthropic’s new approach could be the strongest shield against jailbreaks yet,” writes MIT Technology Review, quoting Alex Robey, a postdoctoral researcher who studies jailbreaks at Carnegie Mellon University, saying “it’s at the frontier of blocking harmful queries.”

Jailbreak vulnerabilities, a kind of cyberattack, were identified in 2014 by a team led by Google engineers, yet since that time no training antidote has been discovered to counteract the misuse.

“Instead of trying to fix its models, Anthropic has developed a barrier that stops attempted jailbreaks from getting through and unwanted responses from the model getting out,” MIT Technology Review says.

Anthropic is now offering “$10,000 to the first person to pass all eight levels of our jailbreaking demo, and $20,000 to the first person to do so with a universal jailbreaking strategy,” with details of the reward and conditions posted at HackerOne.

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.