Someone Made a Dataset of One Million Bluesky Posts for ‘Machine Learning Research’

A machine learning librarian at Hugging Face just released a dataset composed of one million Bluesky posts, complete with when they were posted and who posted them, intended for machine learning research.

Daniel van Strien posted about the dataset on Bluesky on Tuesday:

First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky’s firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

huggingface.co/datasets/blu…

— Daniel van Strien (@danielvanstrien.bsky.social) 2024-11-26T13:50:34.824Z

“This dataset contains 1 million public posts collected from Bluesky Social’s firehose API, intended for machine learning research and experimentation with social media data,” the dataset description says. “Each post contains text content, metadata, and information about media attachments and reply relationships.”

Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research' — A screenshot of the dataset on Hugging Face

The data isn’t anonymous. In the dataset, each post is listed alongside the users’ decentralized identifier, or DID; van Strien also made a search tool for finding users based on their DID and published it on Hugging Face. A quick skim through the first few hundred of the million posts shows people doing normal types of Bluesky posting—arguing about politics, talking about concerts, saying stuff like “The cat is gay” and “When’s the last time yall had Boston baked beans?”—but the dataset has also swept up a lot of adult content, too.

It’s also noteworthy that it’s a “snapshot” of time on Bluesky, meaning it could, and probably does, include since-deleted posts.

This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says. “Out of scope use” includes “building automated posting systems for Bluesky, creating fake or impersonated content, extracting personal information about users, [and] any purpose that violates Bluesky’s Terms of Service.”

The dataset is already popular: as of writing, it’s one of the top trending Hugging Face projects.

The firehose API van Strien references in the dataset description is part of what makes Bluesky unique among other social media platforms. It’s an aggregated, chronological stream of all the public data updates as they happen in the network, including posts, likes, follows, handle changes, and more, according to Bluesky. It’s public, and the platform is built on the open AT Protocol, so anything that runs through the firehose—which, again, is everything that happens on Bluesky—is technically available for independent developers. People have made monitoring tools like Firesky and visualizers by pulling from the firehose, as well as bots and other monitoring tools and services.

Since this is all public, there’s nothing stopping anyone from making datasets out of Bluesky user data to train AI models. But Bluesky as a platform has promised it won’t use that content to train generative AI itself.

Earlier this month, the official Bluesky account posted its stance on user data and AI: “A number of artists and creators have made their home on Bluesky, and we hear their concerns with other platforms training on their data. We do not use any of your content to train generative AI, and have no intention of doing so,” it said. “Bluesky uses AI internally to assist in content moderation, which helps us triage posts and shield human moderators from harmful content. We also use AI in the Discover algorithmic feed to serve you posts that we think you’d like. None of these are Gen AI systems trained on user content.”

In response to a request for comment about van Strien’s dataset, Bluesky spokesperson Emily Liu sent 404 Media the same statement shared with the Verge about users’ posts as training data: “Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don’t always prevent outside companies from crawling those sites, the same applies here. We’d like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we’re actively discussing how to achieve this.”

By comparison, X recently added a clause to its terms of service, in the “Your Rights and Grant of Rights in the Content” section, that says by posting to the site you are granting “worldwide, non-exclusive, royalty-free license (with the right to sublicense) [and] you agree that this license includes the right for us to (i) analyze text and other information you provide and to otherwise provide, promote, and improve the Services, including […] for use with and training of our machine learning and artificial intelligence models, whether generative or another type.” Meta trains its generative AI on users’ data, too.

A lot of people have left old platforms and moved to Bluesky in large part out of protest against having their content and conversations used as AI fodder, and because the decentralized model of social media gives users more ownership and control over their own content. But what makes Bluesky appealing—the open-source, decentralized aspects of its infrastructure—also makes it vulnerable to anyone who wants to do whatever they want with that data, without needing anyone’s permission.

Related Posts