Uncovering security holes in AI filters: a deep dive into using character code to bypass restrictions

AI Knowledge Base6mos agoupdate AI Sharing Circle

2.9K 00

present (sb for a job etc)

Like many others, over the past few days, my news tweets have been filled with stories about Chinese-made DeepSeek-R1 News, praise, complaints and speculation about the Big Language Model, which was released last week. The model itself is being compared to some of the best inference models from OpenAI, Meta, and others. It's reportedly competitive in various benchmarks, which has raised concerns in the AI community, especially since DeepSeek-R1 is said to have been trained using significantly fewer resources compared to its competitors. This has led to a discussion about the potential for more cost-effective AI development. While we could have a broader discussion about its implications and research, that is not the focus of this paper.

Open source model, proprietary chat application

It is important to note that while the model itself is released under a liberal MIT license, the DeepSeek Run their own AI chat application, as well as the one that comes with it - this requires an account. For most people, this is their entry point to DeepSeek, so it's the focus of our tip injection efforts in this article. After all - it's not every day that we see a new, highly commercialized but restricted AI chat product ......

Cue and Response Review

Given that DeepSeek is made in China, it naturally has fairly strict limits on what it will generate answers to. Reports that DeepSeek-R1 is censoring prompts related to sensitive Chinese topics have raised questions about its reliability and transparency - and piqued my curiosity. For example, consider the following:

The DeepSeek-R1 model avoids discussing sensitive issues due to the built-in censorship mechanism. This is because the model was developed in China, where there are strict rules about discussing certain sensitive topics. When a user asks about these topics, the model usually responds with something like "Sorry, this is out of my current scope. Let's talk about something else".

Cue Injection

I've been trying to do a hint injection on this new service. From a threat modeling perspective, what exactly is the interaction pattern here? I hypothesize that it's unlikely that they trained censorship rules directly inside the LLM model. This means that, similar to many commercial AI products, they may have filtered at the input or output stage of the conversation:

Threat models showing possible component interactions for DeepSeek

This is a pattern you often see in various filters, whether they are firewalls, content filters, or censors. These systems are designed to block or clean up certain types of content, but they usually rely on predefined rules and patterns. Think of this almost like a web application firewall (WAF), where you know there has to be some way to manipulate input and output to bypass the cleaner. In DeepSeek's case, I'm assuming that the censorship mechanism isn't built into the model itself, but is applied as a cleanup layer for the input or output. This is similar to how a WAF inspects and filters web traffic on an input field. The challenge then becomes finding a way to communicate with the model that allows it to bypass these filters.

character code

After some experimentation, I found that the best way to accomplish this is to use a specific subset of character codes. Character codes, or character codes, which are numeric representations of characters in a character set. For example, in the ASCII (American Standard Code for Information Interchange) character set, the character code for the letter 'A' is 65. By using these numeric codes, you can represent text in a way that may not be immediately recognized by filters designed to block specific words or phrases. In this example, I'm using base16 (hexadecimal) character codes, which are separated by spaces. This means that each character is represented by a two-digit hexadecimal number, separated by spaces.

Example Injection Attack

By prompting DeepSeek to talk to me using only these character codes, I can effectively bypass the filter.

揭示 AI 过滤器中的安全漏洞：使用字符代码绕过限制的深入研究
On my end, I would translate the character code back into readable text and vice versa. This approach allows me to have an unrestricted dialog with the model, bypassing the imposed limitations.

An easy way to do this back-and-forth mapping is to use the CyberChef formula for character encoding, where you can choose the appropriate base and delimiter.

lessons learned

I've already hinted at the similarities with WAF filters and firewalls. We shouldn't just be inspecting explicit types of traffic/content, especially when it's possible to use transformations on content on either side of the filter - enforce specific content and disable transformations if you can. By taking a more comprehensive approach to content filtering, we can better protect against a wider range of threats and ensure that our security measures remain effective even as attackers develop new ways to bypass them.

This experiment highlights a key aspect of AI and machine learning modeling: the importance of strong security measures. As AI continues to evolve and integrate into various fields, understanding and mitigating potential vulnerabilities becomes critical. The ability to bypass filters using character code is a reminder of the importance of constantly updating security measures and testing against new exploits.

Future studies

Looking ahead, it will be interesting to see how AI developers tackle these kinds of challenges. Will they develop more sophisticated filtering mechanisms, or will they find new ways to embed censorship directly into their models? Only time will tell. For now, this provides a valuable lesson for ongoing efforts to secure AI technology.