In recent years, with the rapid development of Generative AI (GAI) and Large Language Modeling (LLM), the issues of their security and reliability have attracted much attention. A recent study has discovered a new type of LLM known as Best-of-N Jailbreak (BoN for short) in a simple but efficient attack method. By randomly morphing the input content and making repeated attempts, the researchers successfully bypassed the security restrictions of several mainstream AI systems, forcing them to generate harmful or inappropriate content. Shockingly, the attack success rate of this method ranges from 70% to 90%, exposing the significant vulnerability of current AI security mechanisms.
Core principles of the BoN approach
The core idea of the Best-of-N jailbreak method is to gradually find inputs that can break through the system's security constraints by repeatedly performing small deformations on the inputs (text, images, audio) without the need to understand the internal structure of the AI model. These deformations include:
- text input : Randomly change case, reorder characters or add symbols, etc.
- image input : Change backgrounds, overlay text in different fonts.
- audio input : Adjust the pitch, playback speed, or add background noise.
These deformations do not change the core semantics of the input, but can trigger an erroneous response from the AI system in some cases. Through repeated attempts, the researchers were able to find an "optimal morph" input that induced the model to generate content that violated security rules.
Attack Success and Coverage
The research team tested several mainstream AI systems using the BoN methodology, and the results showed that these systems generally exhibit a high degree of vulnerability to such simple attacks. Example:
- GPT-4 The 89%'s input deformation is capable of exceeding its safety limits.
- Claude 3.5 : 78%'s morphing attack succeeded.
- Other mainstream systems : Similar vulnerabilities are exhibited, although the exact success rate varies from system to system.
The success rate of BoN attacks is significantly higher than traditional AI jailbreak techniques, thanks to its randomness and scalability. The method is particularly effective in generative AI tasks, suggesting that the security design of current AI systems lacks sufficient robustness.
Attack Scale and Predictability
The study also shows that the attack success rate increases in a power law with the number of attempts (N-value). This means that by increasing the number of samples or computational power, the BoN method is able to achieve higher attack success rates. This property allows researchers to make accurate predictions of the success rate and thus adjust the attack strategy for a specific model. Example:
- For GPT-4, increasing the number of attempts for 20% increases the success rate by 10%.
- Using a combination of BoN attacks and other techniques, the success rate is further increased while reducing the number of samples and attempts required.
This shows that the BoN method is not only efficient, but also has good scalability to achieve faster and more accurate breakthroughs by combining other attack methods.
Why does the BoN approach work?
The researcher noted that the success of the BoN approach is largely due to the following:
- Input deformation disturbs the model's safety assessment mechanism
Most AI systems rely on specific features or patterns of inputs to identify potential threats. the BoN's random deformation corrupts these features, making it easier for the system to misidentify inputs as safe.
- Black-box nature of interaction with models
BoN does not rely on knowledge of the internal mechanisms of the AI model and only requires external interaction with the system to carry out the attack. This makes it more operational in practice.
- Cross-modal applicability
The method is not only applicable to text input, but is also able to attack visual language models and speech processing models. For example, by adding text to an image or adjusting the attributes of an audio file, the researchers successfully bypassed the security rules of these systems.
AI security wake-up call
The emergence of the Best-of-N jailbreak method is a wake-up call for AI system security. Although the BoN method seems simple, its amazing attack results show that the current security protection of AI systems is still very vulnerable in the face of non-traditional threats.
The researchers suggest that AI developers need to strengthen their security in the following ways:
- Enhanced input robustness
Designing stricter input validation mechanisms for random deformations reduces the likelihood of the system being bypassed.
- Joint multimodal protection
Combine comprehensive text, visual and audio assessments to improve the overall security of the system.
- Introduction of more advanced testing mechanisms
Automatically identify potentially malicious inputs using Generative Adversarial Networks (GAN) or other advanced techniques.
concluding remarks
The research results of the Best-of-N jailbreak method remind us that the development of AI technology not only needs to focus on performance enhancement, but also needs to squarely address the issues of its security and reliability. Only through the continuous improvement of security protection mechanisms can AI systems truly realize safe and reliable widespread application.