Snippet: This aims to streamline and democratise AI data curation.
US-based AI startup Essential AI has released a new 24-trillion token data set called ‘Essential-Web v1.0’. The dataset comprises a corpus of 23.6 billion documents, each annotated with a 12-category taxonomy that covers a wide range of subjects, page types, content complexity, and quality.
“Practitioners can now rapidly and inexpensively curate new datasets by writing SQL-like filters that utilise these metadata columns,” said Essential AI.
The company claims that datasets curated using ESSENTIAL-WEB V1.0 taxonomy outperform existing datasets in various domains. “Our math dataset performs within 8.0% of SOTA and
our web code, STEM, and medical datasets outperform SOTA by 14.3%, 24.5%, 8.6%, respectively,” it said.
The 12-field taxonomy is used to train a classifier (EAI-Distill-0.5b), which labels documents with high efficiency. This can process billions of web documents with minimal manual intervention, significantly reducing the cost and complexity of building domain-specific datasets.
Essential AI fine-tuned Alibaba’s Qwen2.5-0.5b-instruct model to perform the taxonomy classification task. This fine-tuned classifier—the EAI-Distill-0.5—achieves 50 times faster inference speed compared to prompting the parent model while maintaining performance.
“Structured web data transforms corpus curation from [a] complex, expensive processing pipeline into a search problem that anyone can solve,” said the company.
“We hope ESSENTIAL-WEB V1.0 becomes a community commons: a foundation others can refine, audit, or curate in new ways, accelerating open research on LLM training data, arguably the most valuable, yet least shared, asset contributing to modern LLM capabilities.”
Ashish Vaswani, the startup’s co-founder and CEO, was one of the authors of Google’s ‘Attention is All You Need’ paper, which was released in 2017 and introduced the Transformer architecture for AI models.
A detailed technical report of the dataset can be found here.
The post Essential AI Releases 24-Trillion Pre-Training Data Set appeared first on Analytics India Magazine.