The Cloudflare-Perplexity Clash Over Web Crawling

The heated dispute between cloud infrastructure giant Cloudflare and Perplexity, the AI search application company, regarding the latter’s web crawling practices, has brought forth concerns about AI applications scraping without authorisation.

On August 4, Cloudflare published a blog post accusing Perplexity of deceptive web crawling practices, where automated systems browse and index website content, alleging the AI company disguised its identity to bypass sites that had explicitly chosen to block such crawlers.

However, Perplexity disagrees and claims that Cloudflare is misunderstanding the issue. They argue that mass-scale crawling is not the same as an AI agent used by a user to retrieve information from a website.

But in a statement to AIM, Cloudflare reaffirmed its position, stating that content creators should have control over how their material is accessed, and accused Perplexity of misusing this fundamental right.

For context, Perplexity’s search features work by gathering information from websites across the internet to generate an answer to a user query with relevant citations. However, website owners can establish rules to prevent AI bots from accessing or crawling their content, or certain parts of it.

Cloudflare’s initial investigation revealed that Perplexity employed systematic methods to bypass these restrictions by rotating IP addresses, using different network providers, and ignoring robots.txt files. This standard protocol guides automated systems on which parts of a website they are permitted to access.

Cloudflare created test websites with strict access restrictions and found that Perplexity could still provide detailed information about the content, despite the blocks being in place. The company states that when Perplexity’s declared crawlers (ones that indicate they originate from Perplexity) are blocked, the AI service disguises itself as a web browser and imitates Google Chrome on macOS to access restricted content.

“This undeclared crawler utilised multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare,” said the company.

The company reported observing 20-25 million daily requests from Perplexity’s declared crawlers and an additional 3-6 million from what they called “stealth” crawlers.

Furthermore, Cloudflare added that, in contrast to Perplexity, OpenAI follows some of the best practices of crawling the web. “When we ran the same test as outlined above with ChatGPT, we found that ChatGPT-user fetched the robots file and stopped crawling when it was disallowed,” said the company.

Perplexity was also involved in a similar allegation last year, where US-based publisher The WIRED accused the AI startup of illicitly scraping its website.

Perplexity’s Counter

A few hours later, Perplexity responded to this fiasco on X, pointing out that Cloudflare’s allegations are a fundamental misunderstanding of how AI assistants work. Cloudflare’s systems are inadequate to distinguish between legit AI assistants and actual threats, the company said.

For example, Perplexity explained that when you ask a question requiring up-to-date information — such as ‘What are the latest reviews for that new restaurant?’ — it doesn’t have that data stored internally. Instead, it visits relevant websites, reads the content, and provides a customised summary.

“This is fundamentally different from traditional web crawling, in which crawlers systematically visit millions of pages to build massive databases, whether anyone asked for that specific information or not,” the company said, adding that user driven agents like Perplexity only fetch content when a real person requests something specific, and use that to ‘immediately’ answer the users question.

Perplexity stated that when companies like Cloudflare mislabel user-driven AI assistants as malicious bots, they suggest that any automated service should be suspicious, potentially criminalising email clients, web browsers, or any service a gatekeeper disapproves of.

“This controversy reveals that Cloudflare’s systems are fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats,” added Perplexity.

The company described this as ‘overblocking’, which harms legitimate attempts at accessing information. “Consider someone using AI to research medical conditions, compare product reviews, or access news from multiple sources. If their assistant gets blocked as a ‘malicious bot’, they lose access to valuable information,” said Perplexity.

“Cloudflare’s recent blog post managed to get almost everything wrong about how modern AI assistants work,” said Perplexity, while adding that Cloudflare ‘confused’ Perplexity with 3-6 million daily requests of unrelated traffic from BrowserBase, a third-party cloud-browser service that Perplexity occasionally uses.

Perplexity also described the investigation as an ‘embarrassing’ one and stated that Cloudflare’s leadership is ‘dangerously misinformed’ on the basics of AI. Moreover, Perplexity is also a customer of Cloudflare’s services.

AIM reached out to Cloudflare seeking a response towards Perplexity’s comments.

“Rather than addressing their actions, Perplexity’s response attempts to deflect attention by broadening the discussion to all AI agents in ways that weren’t within the scope of our blog. Our point remains specific: content creators should have the right to control access to their content. We believe Perplexity’s admitted practices undermine this fundamental right,” said the company.

It said that when the Perplexity-User agent encountered a block, Cloudflare immediately observed follow-on requests from other user agents, which Perplexity admits belong to a third-party tool they use. Cloudflare also said that Perplexity’s response essentially ‘confirms’ that it does not fetch, or respect the robots.txt directives. Besides, the company also said that it does not block Perplexity unless instructed by a customer.

Divided Opinions

Opinions within the industry remain divided. Guillermo Rauch, the CEO of Vercel, opposed blocking AI crawlers on websites. In a post on X, he stated, “The internet is changing. The answer to AI is: more AI. Not to block and stagnate.”

He mentioned that tools like Perplexity have had an ‘extremely positive effect’ on the company’s business. He said that when developers ask these AI tools for platform recommendations, they often suggest Vercel, which boosts growth and sign-ups on these platforms.

There is also an opposition to Rauch’s arguments, indicating a possible misunderstanding of the situation. Matija Grcic, a developer, said on X that the issue is about Perplexity spoofing the user agent and scraping sites that it should not. “They [Cloudflare] aren’t blocking if the site owners allow the crawl and adhere to standards,” he said.

Moreover, is it fair to treat an AI agent, even if it is user-facing, the same way as a human? Ross Wightman, who works on computer vision at Hugging Face, pointed towards this problem. “Agents or crawlers, I’m sure Cloudflare had users complaining about the access patterns. Until website operators/biz [businesses] have figured out how to capture value from agent use, many would probably prefer robots.txt to be respected for agents.”

The post The Cloudflare-Perplexity Clash Over Web Crawling appeared first on Analytics India Magazine.

Perplexity’s Counter

Divided Opinions

Related Posts