Study Finds A Third of New Websites are AI-Generated

Researchers working with data from the Internet Archive have discovered that a third of websites created since 2022 are AI-generated. The team of researchers—which includes people from Stanford, the Imperial College London, and the Internet Archive—published their findings online in a paper titled “The Impact of AI-Generated Text on the Internet.” The research also found that all this AI-generated text is making the web more cheery and less verbose.

Inspired by the Dead Internet Theory—the idea that much of the internet is now just bots talking back and forth—the team set out to find out how ChatGPT and its competitors had reshaped the internet since 2022. “The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments,” the researchers write in the paper. “We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022.”

“I find the sheer speed of the AI takeover of the web quite staggering,” Jonáš Doležal, an AI researcher at Stanford and co-author of the paper, told 404 Media. “After decades of humans shaping it, a significant portion of the internet has become defined by AI in just three years. We’re witnessing, in my opinion, a major transformation of the digital landscape in a fraction of the time it took to build in the first place.”

The researchers also tested six common critiques of AI-generated text. Does it lead to a shrinking of viewpoints? Does it create more disinformation as hallucinations proliferate? Does online writing feel more sanitized and cheerful? Does it frail to cite its sources? Does it create strings of words with low semantic density? Has it forced writing into a monoculture where unique voices vanish and a generic, uniform style takes hold?

To answer these questions, the researchers partnered with the Internet Archive to pull samples of websites from the 33 months between August 2022 and May 2025. “For each sampled URL, we retrieve the oldest available archived snapshot via the Wayback Machine’s CDX Server API,” the research said. “The raw HTML of each snapshot is downloaded and stored locally for subsequent processing.”

The researchers took the extracted website text and used the AI-detection software Pangram v3 to find AI-created websites. The team tested several AI-detection tools and found Pangram v3 had the highest detection rate. Once Pangram v3 had identified an AI-generated website, the researchers used that website as a sample to test their other six hypotheses. “For each hypothesis, we define a measurable signal, compute it for each monthly sample of websites, and test whether it correlates with the aggregate AI likelihood score across months,” the research said.

To test if AI was creating an internet full of falsehoods, for example, the team extracted fact based claims from the websites they’d selected and then paid human factcheckers to verify them. To figure out if AI is citing its sources, the team computed the outbound link density in AI-generated text.

To the surprise of the researchers, only two of the six theories they tested about the effects of AI-generated text seemed true. AI was making the internet less semantically diverse and more positive overall, but it wasn’t causing a proliferation in lies or cutting out its sources.

“The most surprising result was that our Truth Decay hypothesis wasn’t confirmed,” Doležal said. “It’s worth noting that we were specifically looking for an increase in verifiably untrue statements, which we didn’t find. But it could still be the case that AI is quietly increasing the volume of unverifiable claims, ones that can’t be checked against existing fact-checking tools and infrastructure. Or it may simply be that the internet wasn’t a particularly truth-adhering place to begin with.”

The researchers said they’d continue to study how AI-generated text shaped the internet. “We’re now working with the Internet Archive to turn this into a continuous tool that keeps providing this signal going forward, rather than a single fixed snapshot bounded by the static nature of a paper,” Maty Bohacek, a student researcher at Stanford and one of the co-authors of the paper, told 404 Media. “We’re also interested in adding more granularity: looking at which kinds of websites are most affected, broken down by category or language, and generally providing more nuance about where these impacts are landing.”

For Doležal, studies like this are critical for ensuring a useful and productive internet. “As AI-generated content spreads, the challenge is finding a role for these models that doesn’t just result in a sanitized, repetitive web,” he said. “Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or ‘friction’ might help them act as a creative partner rather than a replacement for human voice.”

Related Posts