Anthropic Releases Rule Classifier: Effective Defense Against Jailbreak Attacks on Large Language Models, Participate in Tests for Bonuses!

1.4K 00

With the rapid development of artificial intelligence technology, large-scale language models (LLMs) are changing our lives at an unprecedented rate. However, advances in technology have also brought new challenges: LLMs can be maliciously exploited to leak harmful information or even be used to create chemical, biological, radiological, and nuclear weapons (CBRN). To address these threats, the Anthropic security research team has come up with an innovative defense mechanism - theRule Classifiers (Constitutional Classifiers)This research aims to defend against "generic jailbreak" attacks that systematically bypass model security by training classifiers based on natural language rules. This research aims to defend against "generic jailbreak" attacks that systematically bypass model security by training classifiers based on natural language rules.

--Constitutional Classifiers: Defense Against Generic Jailbreak Attacks

Anthropic Safeguards Research Team A new paper has been released , describes a method for defending AI models against generic jailbreak attacks. A prototype version of the method was robust to generic jailbreak attacks in thousands of hours of manual red team testing, albeit with a high over-rejection rate and computational overhead. An updated version achieves similar robustness in synthetic evaluations and has a denial rate increase of only 0.38% with a moderate increase in computational cost.

The Anthropic team is currently hosting a temporary online demo version of the Constitutional Classifiers system, and encourages readers with experience in jailbreaking AI systems to help "red team test" it. Please click Demo Site Learn more.

Anthropic 发布规则分类器：有效防御大语言模型越狱攻击，参与测试领奖金！

Update February 5, 2025: The Anthropic team is now offering a $10,000 reward for the first person to demo all eight levels via a jailbreak, and a $20,000 reward for the first person to do so using a universal jailbreak strategy. For full details on the rewards and associated conditions, please visit HackerOneThe

Large language models undergo extensive security training to prevent harmful outputs. For example, Anthropic trains Claude Refuse to respond to user inquiries involving the production of biological or chemical weapons.

However, the model remains vulnerable to jailbreak (an iOS device etc) Attacks: inputs designed to bypass its security guards and force a harmful response. Some jailbreak attacks use Very long multi-sample prompts flooding models; others modify Input style , for example uSiNg uNuSuAl cApItALiZaTiOn this unusual case pattern. Historically, jailbreak attacks have proved difficult to detect and stop: such attacks have been It's been described more than 10 years ago. But as far as Anthropic knows, there are still no fully robust deep learning models in production environments.

Anthropic is developing better jailbreak defenses so that increasingly powerful models can be safely deployed in the future. In accordance with Anthropic's Responsible Scaling Strategy, Anthropic can deploy such models as long as Anthropic is able to mitigate the risk to an acceptable level with appropriate security measures - but jailbreak attacks allow users to bypass these security measures. In particular, Anthropic expects that systems defended by Constitutional Classifiers will enable Anthropic to mitigate the risk of jailbreaking models that have passed the CBRN capability thresholds outlined in Anthropic's Responsible Extension Strategy ^1^.

exist Anthropic's new paper In it, Anthropic describes a program based on the Constitutional Classifiers of a system that protects models from jailbreak attacks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter out the vast majority of jailbreak attacks while minimizing over-rejection and without incurring significant computational overhead.

Results from manual red team testing

Anthropic conducted two main types of tests to evaluate the effectiveness of Constitutional Classifiers.

First, Anthropic developed a prototype version of the system to identify and block specific scientific knowledge related to chemical, biological, radiological, and nuclear hazards. Anthropic then invited independent jailbreakers to participate in a bug bounty program in which they were asked to "red team test" the system (i.e., try to crack the system under experimental conditions to test its robustness).

Specifically, they were given a list of ten "forbidden" queries, and their task was to use whatever jailbreaking techniques they wanted to make one of Anthropic's current models (in this case Claude 3.5 Sonnet, June 2024) answer all of them under the protection of the prototype version of Constitutional Classifiers. protection of the Constitutional Classifiers prototype version, answer all queries. This is only possible if the model has a good understanding of the possess Anthropic considered it a successful "generic" jailbreak when the queries provided detailed answers.

During the two-month experiment, 183 active ^2^ participants spent an estimated 3,000+ hours attempting to jailbreak the model. If they found a generic jailbreak, they would receive a monetary reward of up to $15,000. despite the effort, none of the participants were able to induce the model to answer all ten prohibited queries with a single jailbreak - i.e., no generic jailbreaks were found.

Although this prototype system is robust to jailbreak attacks, it has some problems: it rejects too many harmless queries and is expensive to run. A system with these two problems might be more secure, but it might also not be easy to deploy. Since then, Anthropic has improved Anthropic's system to achieve a similar level of robustness while greatly reducing these problems.

Results from automated assessments

Anthropic ran a set of automated evaluations in which Anthropic synthetically generated 10,000 jailbreak hints, including many of the attacks that are most effective against the current large language model, as well as attacks designed to circumvent classifier security measures.Anthropic attempted these hints on the Constitutional Classifiers-protected Claude 3.5 Sonnet version (October 2024) protected by Constitutional Classifiers, and on a version of Claude not protected by classifiers.

In the baseline condition, the jailbreak success rate without defensive classifiers was 86% - that is, Claude itself blocked only 14% of these advanced jailbreak attempts. However, protecting Claude with Constitutional Classifiers produces a significant improvement: the jailbreak success rate drops to 4.4%, which means that more than 95% of jailbreak attempts are rejected.

Anthropic 发布规则分类器：有效防御大语言模型越狱攻击，参与测试领奖金！
Figure 1: Results from automated assessments. For all graphs, the lower the value, the better. (a) the success rate of jailbreaks is much lower on Constitutional Classifiers-protected systems; (b) there is no statistically significant increase in the rejection rate of systems on production Claude.ai Free and Pro traffic when using Constitutional Classifiers; (c) the relative computational cost for systems using (c) The relative computational cost of the system is only moderately higher with Constitutional Classifiers. Error bars represent 95% confidence intervals calculated using binomial proportional standard errors under the assumption of asymptotic normality.

At the same time, Anthropic's Constitutional Classifiers do not substantially increase the rate of excessive rejections of innocuous user queries: there is a slight increase in Claude's rejection rate with the classifiers (an increase of 0.381 TP3T), but this is not statistically significant in a random sample of 5,000 conversations. Finally, there is a moderate increase in computational cost (an increase of 23.71 TP3T), which is higher than the unprotected model.Anthropic is working to further reduce rejection and computational cost as Anthropic improves this technique.

Overall, Anthropic's automated analysis found that this updated version of the Constitutional Classifiers system significantly improves the robustness of AI models against jailbreak attacks - and does so at a minimal additional cost.

Working Principle

Constitutional Classifiers is based on the same principles as Constitutional AI In a similar process, Constitutional AI is another technique Anthropic uses to align Claude. Both techniques use a charter: a list of principles to which the model should adhere. In the case of Constitutional Classifiers, these principles define permissible and impermissible content categories (e.g., recipes for mustard are permissible, but recipes for mustard gas are not).

With the help of Claude, Anthropic used this charter to generate a large number of synthetic hints and synthetic model complements across all content categories.Anthropic expanded these hints and complements to ensure that the list was diverse: this included translating them into different languages and converting them to be written in a known jailbreak style.

Anthropic 发布规则分类器：有效防御大语言模型越狱攻击，参与测试领奖金！
Figure 2: Training and implementation of Constitutional Classifiers.(a) Charters are developed to specify harmless and harmful categories; (b) Charters are used as the basis for generating a large number of synthetic cues and complements that are further expanded (stylistic and linguistic variations) and transformed into a training set; (c) Classifiers trained on this training set are used as models security measures to detect and block harmful content.

Anthropic then uses this synthetic data to train Anthropic's input and output classifiers to flag (and block) potentially harmful content based on a given charter. To help minimize over-rejection (i.e., incorrectly labeling harmless content as harmful), Anthropic also trains the classifiers on a fixed set of benign queries generated by contractors.

limitations

Constitutional Classifiers may not be able to stop every generic jailbreak, but Anthropic believes that even the small number of jailbreak attacks that make it past Anthropic's classifiers will require more effort to detect when security measures are used. It is also possible that new jailbreaking techniques may be developed in the future that are effective on this system; therefore, Anthropic recommends the use of complementary defenses. However, the charter used to train the classifier can be quickly adapted to cover new types of attacks as they are discovered.

The full paper contains all the details about the Constitutional Classifiers method and the classifiers themselves.

Constitutional Classifiers Online Demo

Want to test Claude with the Red Team for yourself? Anthropic invites you to try a demo of Anthropic's Constitutional-Classifiers-protected system and to try jailbreaking Claude's 3.5 Sonnet version, which is protected by Anthropic's new technology.

While Constitutional Classifiers technology is flexible enough to adapt to any topic, Anthropic chose to focus its presentation on queries related to chemical weapons.

Challenging users to attempt to jailbreak Anthropic's products serves an important security purpose: Anthropic wanted to stress-test Anthropic's systems under real-world conditions, beyond the scope of the testing Anthropic did for the paper. This allows Anthropic to collect more data and improve the robustness of the method before deploying it to Anthropic's production systems in the future.

demonstrations will be published in February 3, 2025 until February 10, 2025 Anthropic will make it available online during this period. It includes a feedback form where you can contact Anthropic to report any successful jailbreaks, as well as information on Anthropic's Responsible Disclosure Policy, which Anthropic requires participants to adhere to.Anthropic will announce any successes and general results of the demo in updates to this post.

Update February 5, 2025: As mentioned above, Anthropic is now offering a monetary reward for successfully jailbreaking the Anthropic system. the first person to pass all eight levels of the jailbreak demo will win $10,000 USD. The first person to pass all eight levels using the universal jailbreak strategy will win $20,000. For complete details on the rewards and associated conditions, visit HackerOne.