AI Personal Learning
and practical guidance
豆包Marscode1

Cloudflare Launches AI Maze: Countering Malicious Crawlers with Generative AI

Artificial intelligence-generated content is growing at an unprecedented rate, with four of the 20 most popular posts on Facebook last fall reportedly generated by AI. In addition, Medium estimates that 47% of the content on its platform also comes from AI.As with all emerging tools, AI has both positive applications and risks of being exploited for malicious purposes.

At the same time, we have observed a surge in new types of web crawlers used by AI companies to train their models. These AI crawlers make more than 50 billion requests per day to the Cloudflare network, accounting for nearly 1% of our network's total requests. While Cloudflare already has tools to identify and block unauthorized AI crawling, practice has shown that blocking a malicious Bot outright may alert the attacker to a change in tactics, leading to an endless "cat-and-mouse game". endless game of cat and mouse. Cloudflare wanted to find a new way to block these unwanted Bots without being detected.


To achieve this, Cloudflare decided to adopt a strategy that is a common tool for Bot creators, but has rarely been applied in the defense space before: AI-generated content. When Cloudflare detects an unauthorized crawl, instead of blocking the request outright, it links to a series of AI-generated pages. The content of these pages is realistic enough to entice the crawler to continue crawling deeper. However, the seemingly real content is not actually the real data on the site itself, allowing the crawler to waste time and computing resources on false information.

The "AI maze" has the added advantage of being a new generation of honeypot technology. Because a normal human user would never delve into an AI-generated maze of meaningless links, any obsessive visitor would most likely be a Bot. Any persistent visitor is most likely a Bot, giving Cloudflare a new tool for identifying and tagging malicious Bots and adding them to the list of known malicious actors. Here's exactly how the "AI Maze" works.

 

"How the labyrinth was constructed.

When AI crawlers follow these links, they consume valuable computing resources to process irrelevant content instead of crawling the site for real data. This greatly diminishes the ability of AI crawlers to efficiently gather information to train models.

In order to generate human-like content that is as authentic as possible, Cloudflare utilized Workers AI, combined with open source models, to create unique HTML pages on a variety of topics. To avoid the performance impact of generating content on-demand, Cloudflare implemented a pre-generation process, cleaned the content to prevent XSS vulnerabilities, and stored the content in R2 for quick retrieval.Cloudflare found that identifying a variety of topics and then generating content for each topic produced more varied and confusing results. What's more, Cloudflare is keenly focused on avoiding generating inaccurate content that contributes to the spread of disinformation on the Internet. As a result, the content generated by "AI Maze" is authentic and relevant to scientific facts, but not to the content of the crawled websites.

This pre-generated content is seamlessly integrated into the hidden links of existing pages through Cloudflare's customized HTML conversion process, without disrupting the original structure and content of the page. Each generated page includes appropriate meta directives to protect SEO and prevent indexing by search engines, and Cloudflare also ensures that these links are invisible to human visitors through carefully crafted attributes and styling. To further minimize the impact on normal visitors, these links are only presented to suspected AI crawlers, while allowing legitimate users and authenticated crawlers to view them normally.

Cloudflare 推出 AI 迷宫:利用生成式 AI 反制恶意爬虫-1

Figure : AI Crawler Daily Requests Comparison Chart, Comparison of Crawler Data by Category.

What makes this approach effective is the role it plays in Cloudflare's evolving bot detection system. When these hidden links are visited, Cloudflare can be highly confident that this is not the behavior of a human user, but rather an automated crawler in action, since human visitors and normal browsers will never see or click on these links. This provides Cloudflare with a powerful identification mechanism that generates valuable data that feeds into machine learning models. By analyzing which crawlers are following these hidden paths, Cloudflare can identify new Bot patterns and features that might otherwise be difficult to detect. This proactive approach helps Cloudflare stay ahead of the curve in the AI crawler space, continually improving detection capabilities without disrupting the normal browsing experience for users.

By building this solution on the Cloudflare Developer Platform, Cloudflare creates a system that delivers realistic decoy content instantly and maintains consistent content quality without impacting site performance or user experience.

 

How to Stop AI Crawlers with AI Maze

Enabling the AI Labyrinth is as simple as flipping a switch in the Cloudflare dashboard. Navigate to the Bot Management section in your region, and then toggle the new "AI Maze" setting to "On":

Cloudflare 推出 AI 迷宫:利用生成式 AI 反制恶意爬虫-1

Cloudflare 推出 AI 迷宫:利用生成式 AI 反制恶意爬虫-2

Once enabled, the AI Maze will start working immediately without any additional configuration.

 

AI honeypot, created by AI

The core strength of the "AI Maze" is to confuse and distract the Bot. However, another important value is that it acts as a next-generation honeypot. In this context, honeypots are hidden links that are invisible to website visitors, but which a Bot parsing HTML code recognizes and clicks on, exposing itself. "Honeypot" technology has been used to catch hackers in the early days, such as in "The Cuckoo's Egg Incident" in 1986. in 2004, Cloudflare's founders (before creating In 2004, the founders of Cloudflare (before creating Cloudflare) created Project Honeypot to make it easy for anyone to deploy free email honeypots and contribute to the database in exchange for a list of crawler IPs. But as Bots have evolved, they now actively seek out honeypotting techniques, such as hidden links, which makes traditional honeypotting methods less effective.

The "AI Maze" not only adds hidden links, but eventually creates entire networks of interlinked URLs that are more realistic and difficult for automated programs to recognize. The content on the page is obviously not something a human would spend time browsing, but the AI Bot is programmed to crawl deeply to gather as much data as possible. When the Bot visits these URLs, Cloudflare can be confident that they are not real human users, and this information is recorded and automatically fed into Cloudflare's machine learning models to help improve Bot recognition. This creates a virtuous cycle where each crawl attempt helps protect all Cloudflare customers.

 

future outlook

This is just Cloudflare's first attempt at countering a Bot with generative AI. For now, while the "AI maze" generates enough content to confuse a Bot, it may not fit perfectly into the existing structure of each website. Going forward, Cloudflare will continue to work on making these links harder to detect and allowing them to fit seamlessly into the existing structure of embedded websites. Users can help Cloudflare improve this feature by opting to enable "AI Maze" now.

To take the next step in the fight against Bot, immediatelyEnabling the AI MazeThe

May not be reproduced without permission:Chief AI Sharing Circle " Cloudflare Launches AI Maze: Countering Malicious Crawlers with Generative AI
en_USEnglish