Perplexity Just Got Caught Breaking the Rules Red-Handed

Over two decades ago, the New Oxford American Dictionary wanted to see if any of its competitors were cribbing its definitions. So it set up a trap. In its first edition, published in 2001, NOAD included a word called “esquivalience,” which it defined as the “willful avoidance of one’s official responsibilities.”

The word was a fake. And the bait worked: the word reference website Dictionary.com was caught using “esquivalience,” attributing it to Merriam Webster’s New Millennium. Its guilt was undeniable, and the debacle gained considerable media coverage.

These copyright traps have a name: “mountweazels” — a term with its own curious history — and an evolution of them is now being used by companies fending off AI data scrapers that devour vast swathes of the internet without asking permission.

In a lawsuit against four tech companies filed Wednesday and covered by The New York Times, Reddit revealed how it managed to ensnare the AI startup Perplexity with its own sort of mountweazel. The forum-based social media platform put up a “test post” on its site that could “only be crawled by Google’s search engine and was not otherwise accessible anywhere on the internet,” it said. 

But within hours, Perplexity’s AI-powered search engine showed the content from the trap Reddit post.

“Perplexity’s business model is effectively to take Reddit’s content from Google search results,” then feed it into an AI model and “call it a new product,” Reddit lawyers argued in the suit, per the NYT.

It’s the latest lawsuit to put the AI industry’s voracious use of scraped data under the spotlight. Training the powerful large language models that power AI products like ChatGPT would not have been possible without having free access to an unbelievable wealth of data, much of it copyrighted. Reddit itself is trying to cash in on the AI data demand by locking out scrapers and selling its user data at a premium. It expects to make over $200 million over the next few years through the data licensing venture.

In addition to Perplexity, the Reddit suit targets three more data scraping firms: SerpApi based in Texas; Oxylabs, a Lithuanian startup; and AWMProxy in Russia, which has been linked to a notorious malware botnet called Glupteba.

Years before the AI boom, these companies scraped mountains of Google search data to provide search engine optimization services to businesses. Google’s search results were themselves created by scraping websites and then organizing that data. For the most part, this created a mutually beneficial relationship, since scraping helped direct traffic to the websites the data came from through search results, the NYT explains.

But then these SEO firms started selling their troves of scraped Google data directly to AI companies. The AI chatbots that were trained on these data sets don’t direct a meaningful amount of traffic to the websites they get their data from — if they give accurate attributions at all —and suddenly, the relationship became one-sided.

Reddit, which is experimenting with its own built-in AI, says that Perplexity bought these firms’ scraped data sets, circumventing a cease and desist order Reddit sent after it caught Perplexity directly scraping data from its posts without paying for it. The lawsuit noted that citations to Reddit data in Perplexity’s AI search results had jumped “fortyfold,” per the NYT.

More on AI: It’s Still Ludicrously Easy to Generate Copyrighted Characters on ChatGPT

The post Perplexity Just Got Caught Breaking the Rules Red-Handed appeared first on Futurism.

Scroll to Top