AI Crawlers: The Nasty Bugs Causing Trouble on the Internet

ai web crawlers

AI tools with web search capabilities, such as Anthropic’s Claude, browse the internet to deliver users the needed information. Perplexity, OpenAI, and Google offer similar features through ‘Deep Research’. 

In a blog post, Cloudflare explained that these web crawlers, often referred to as AI crawlers, deploy the same techniques as search engine crawlers to gather available information.

While the aim of AI crawlers is to assist users, they may be causing more damage on the internet than one realises. They are believed to increase server resource usage for website administrators, leading to unwanted bills and causing disruptions.

AI Crawlers on The Rise of Being a Hassle 

Gergely Orosz, creator of The Pragmatic Engineer newsletter, shared on LinkedIn, “AI crawlers are wrecking the open internet, and I’m now being hit for the bill for their training.”

He explained that his website, a side project, initially had a few thousand visitors a month and used around 100 GB of server bandwidth. But, after Meta’s AI crawler and other bots like Imagesiftbot started crawling the website, more than 700 GB of bandwidth was consumed, leading to an extra $90 in bills.

Orosz expressed frustration over having to pay all this extra money to help train LLMs. Furthermore, he added that crawlers ignore robots.txt file. “The irony is how the bots—including Meta! — blatantly ignore the robots.txt on the site that tells them ‘please stay away’…I’m upset – and have had enough.”

Vercel, a cloud platform company, shared some interesting statistics from their network in a blog post that said: “AI crawlers have become a significant presence on the web. OpenAI’s GPTBot generated 569 million requests across Vercel’s network in the past month, while Anthropic’s Claude followed with 370 million.”

Source: Vercel

“For perspective, this combined volume represents about 20% of Googlebot’s 4.5 billion requests during the same period,” it added.

Xe Iaso, a software developer, expressed frustration upon noticing that AmazonBot was consuming their Git server resources. Attempts to block it resulted in failure. Iaso stated in the blog post, “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more. I just want the requests to stop.”

The developer created an open source solution, Anubis, to present a challenge to AI crawlers and block the requests.

The developer’s quick solution turned out to be helpful to others as well. Bart Piotrowski, a system administrator at GNOME, used it to fend off AI crawlers from GNOME’s GitLab instance, which were reportedly taking 90% of their resources.

Drew Devault, founder of SourceHut, wrote a blog post voicing something similar: “Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.”

Ars Technica reached a similar conclusion for AI crawlers, focusing on its impact on open source projects. Many other reports indicate that people are attempting to fend off AI crawlers consuming their web resources.

What Can Be Done?

Solutions such as Iaso’s Anubis, though not suitable for everyone, are a good option and are increasingly being embraced by individuals.

Cloudflare has joined the fight against AI bots that do not honour the robots.txt rule with AI Labyrinth, which uses AI-generated content to keep the crawler occupied and waste its resources.

Source: Cloudflare

“Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see. While Cloudflare has several tools for identifying and blocking unauthorised AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race,” the Cloudflare blog read. 

It added, “So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.”

In addition to the solutions mentioned above, AI companies can do their bit by improving their crawlers to respect the web resources and be a little less aggressive in their information-hunt process.

While the web search functionality in AI tools provides great value, it should not come at the cost of disrupting the web server resources of small or independent web admins.

The post AI Crawlers: The Nasty Bugs Causing Trouble on the Internet appeared first on Analytics India Magazine.

Scroll to Top