Carnegie Mellon Universt
Abstract
AI scientist systems, capable of autonomously executing the full research
workflow from hypothesis generation and experimentation to paper writing, hold
significant potential for accelerating scientific discovery. However, the
internal workflow of these systems have not been closely examined. This lack of
scrutiny poses a risk of introducing flaws that could undermine the integrity,
reliability, and trustworthiness of their research outputs. In this paper, we
identify four potential failure modes in contemporary AI scientist systems:
inappropriate benchmark selection, data leakage, metric misuse, and post-hoc
selection bias. To examine these risks, we design controlled experiments that
isolate each failure mode while addressing challenges unique to evaluating AI
scientist systems. Our assessment of two prominent open-source AI scientist
systems reveals the presence of several failures, across a spectrum of
severity, which can be easily overlooked in practice. Finally, we demonstrate
that access to trace logs and code from the full automated workflow enables far
more effective detection of such failures than examining the final paper alone.
We thus recommend journals and conferences evaluating AI-generated research to
mandate submission of these artifacts alongside the paper to ensure
transparency, accountability, and reproducibility.
Stanford University
Abstract
We introduce Paper2Agent, an automated framework that converts research
papers into AI agents. Paper2Agent transforms research output from passive
artifacts into active systems that can accelerate downstream use, adoption, and
discovery. Conventional research papers require readers to invest substantial
effort to understand and adapt a paper's code, data, and methods to their own
work, creating barriers to dissemination and reuse. Paper2Agent addresses this
challenge by automatically converting a paper into an AI agent that acts as a
knowledgeable research assistant. It systematically analyzes the paper and the
associated codebase using multiple agents to construct a Model Context Protocol
(MCP) server, then iteratively generates and runs tests to refine and robustify
the resulting MCP. These paper MCPs can then be flexibly connected to a chat
agent (e.g. Claude Code) to carry out complex scientific queries through
natural language while invoking tools and workflows from the original paper. We
demonstrate Paper2Agent's effectiveness in creating reliable and capable paper
agents through in-depth case studies. Paper2Agent created an agent that
leverages AlphaGenome to interpret genomic variants and agents based on ScanPy
and TISSUE to carry out single-cell and spatial transcriptomics analyses. We
validate that these paper agents can reproduce the original paper's results and
can correctly carry out novel user queries. By turning static papers into
dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for
knowledge dissemination and a foundation for the collaborative ecosystem of AI
co-scientists.
AI Insights - Paper2Agent’s six‑step pipeline: locate code, set up environment, discover tutorials, audit execution, extract tools, assemble MCP server.
- The orchestrator agent coordinates sub‑agents, ensuring reliable execution of complex workflows across the MCP ecosystem.
- An MCP server exposes a paper’s tools via a standardized API, enabling reproducible, production‑ready analysis.
- Agents built for AlphaGenome, TISSUE, and Scanpy reproduced original results and answered novel queries.
- Generating agents demands high computational resources for large datasets, a noted limitation.