Central China Normal Unv
Abstract
The automated generation of research workflows is essential for improving the
reproducibility of research and accelerating the paradigm of "AI for Science".
However, existing methods typically extract merely fragmented procedural
components and thus fail to capture complete research workflows. To address
this gap, we propose an end-to-end framework that generates comprehensive,
structured research workflows by mining full-text academic papers. As a case
study in the Natural Language Processing (NLP) domain, our paragraph-centric
approach first employs Positive-Unlabeled (PU) Learning with SciBERT to
identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772.
Subsequently, we utilize Flan-T5 with prompt learning to generate workflow
phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of
0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically
categorized into data preparation, data processing, and data analysis stages
using ChatGPT with few-shot learning, achieving a classification precision of
0.958. By mapping categorized phrases to their document locations in the
documents, we finally generate readable visual flowcharts of the entire
research workflows. This approach facilitates the analysis of workflows derived
from an NLP corpus and reveals key methodological shifts over the past two
decades, including the increasing emphasis on data analysis and the transition
from feature engineering to ablation studies. Our work offers a validated
technical framework for automated workflow generation, along with a novel,
process-oriented perspective for the empirical investigation of evolving
scientific paradigms. Source code and data are available at:
https://github.com/ZH-heng/research_workflow.
AI Insights - PU learning, proven in spam and fakeâreview detection, isolates workflowâdescriptive paragraphs from fullâtext papers.
- FlanâT5 prompt learning converts these paragraphs into workflow phrases, a technique transferable to other generative tasks.
- Fewâshot ChatGPT classification reaches 0.958 precision, showcasing largeâmodel prompting for structured knowledge extraction.
- The openâsource GitHub repo invites extensions to domains beyond NLP, fostering communityâdriven workflow mining.
- The analysis reveals a twoâdecade shift from feature engineering to ablation studies, highlighting growing dataâanalysis focus.
- Recommended reading: âSpeech and Language Processingâ by Jurafsky & Martin and Devlin et al.âs BERT paper for foundational context.
University of Copenhagen
Abstract
Peer review remains the central quality-control mechanism of science, yet its
ability to fulfill this role is increasingly strained. Empirical studies
document serious shortcomings: long publication delays, escalating reviewer
burden concentrated on a small minority of scholars, inconsistent quality and
low inter-reviewer agreement, and systematic biases by gender, language, and
institutional prestige. Decades of human-centered reforms have yielded only
marginal improvements. Meanwhile, artificial intelligence, especially large
language models (LLMs), is being piloted across the peer-review pipeline by
journals, funders, and individual reviewers. Early studies suggest that AI
assistance can produce reviews comparable in quality to humans, accelerate
reviewer selection and feedback, and reduce certain biases, but also raise
distinctive concerns about hallucination, confidentiality, gaming, novelty
recognition, and loss of trust. In this paper, we map the aims and persistent
failure modes of peer review to specific LLM applications and systematically
analyze the objections they raise alongside safeguards that could make their
use acceptable. Drawing on emerging evidence, we show that targeted, supervised
LLM assistance can plausibly improve error detection, timeliness, and reviewer
workload without displacing human judgment. We highlight advanced
architectures, including fine-tuned, retrieval-augmented, and multi-agent
systems, that may enable more reliable, auditable, and interdisciplinary
review. We argue that ethical and practical considerations are not peripheral
but constitutive: the legitimacy of AI-assisted peer review depends on
governance choices as much as technical capacity. The path forward is neither
uncritical adoption nor reflexive rejection, but carefully scoped pilots with
explicit evaluation metrics, transparency, and accountability.