OpenAI’s New Benchmark to Study AI Agents’ Research Capabilities

OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research. This test aims to check if an AI can understand research papers, write code, and execute them to match the paper’s results.

PaperBench uses 20 top papers from the International Conference on Machine Learning (ICML) 2024, covering 12 different topics. The research paper contains 8,316 individually gradable tasks. Rubric, an objective evaluation system, was developed to decompose each task hierarchically into smaller subtasks with clear grading criteria. These were co-developed with the authors of each ICML paper for accuracy and realism.

The AI has to get the details from the paper and submit all the code required to reproduce the paper in a repository. The benchmark needs the AI to also create a ‘reproduce.sh’ script to help execute the code, which could potentially reproduce the results of the paper successfully.

All of this was decided to be evaluated by an AI judge, which OpenAI claims to be as close as a human judge. “Our best LLM-based judge, which uses o3-mini-high with custom scaffolding, achieves an F1 score of 0.83 on the auxiliary evaluation, suggesting that this judge is a reasonable stand-in for a human judge,” the research paper stated.

Several AI models were tested on PaperBench. The best performing model was Anthropic’s Claude 3.5 Sonnet, which achieved a 21.0% replication score. Other models, including OpenAI’s o1, GPT-4o, Gemini 2.0 Flash, and DeepSeek-R1, scored lower.

In comparison, human PhDs in machine learning scored 41.4% on average, suggesting that current AI is far from human expertise.

A separate test was also conducted with OpenAI’s o1 for extended duration, which still failed to match the human attempt.

PaperBench’s code is available to the public on GitHub. A lightweight version of the benchmark, PaperBench Code-Dev, is also available for more people to use.

The post OpenAI’s New Benchmark to Study AI Agents’ Research Capabilities appeared first on Analytics India Magazine.

Scroll to Top