Brookhaven National Lab
Abstract
The collaborative efforts of large communities in science experiments, often
comprising thousands of global members, reflect a monumental commitment to
exploration and discovery. Recently, advanced and complex data processing has
gained increasing importance in science experiments. Data processing workflows
typically consist of multiple intricate steps, and the precise specification of
resource requirements is crucial for each step to allocate optimal resources
for effective processing. Estimating resource requirements in advance is
challenging due to a wide range of analysis scenarios, varying skill levels
among community members, and the continuously increasing spectrum of computing
options. One practical approach to mitigate these challenges involves initially
processing a subset of each step to measure precise resource utilization from
actual processing profiles before completing the entire step. While this
two-staged approach enables processing on optimal resources for most of the
workflow, it has drawbacks such as initial inaccuracies leading to potential
failures and suboptimal resource usage, along with overhead from waiting for
initial processing completion, which is critical for fast-turnaround analyses.
In this context, our study introduces a novel pipeline of machine learning
models within a comprehensive workflow management system, the Production and
Distributed Analysis (PanDA) system. These models employ advanced machine
learning techniques to predict key resource requirements, overcoming challenges
posed by limited upfront knowledge of characteristics at each step. Accurate
forecasts of resource requirements enable informed and proactive
decision-making in workflow management, enhancing the efficiency of handling
diverse, complex workflows across heterogeneous resources.
AI Insights - PanDA now runs a full ML pipeline that predicts memory, CPU, I/O, and walltime with sub‑second latency.
- 70 % of tasks are predicted within 5 % of actual usage, cutting idle time dramatically.
- Future work includes clustering task attributes, adding domain knowledge, and a feedback loop for continuous model refinement.
- Transfer learning across diverse scientific workflows is proposed to generalize the models beyond the current dataset.
- The authors cite “pipecomp, a General Framework for the Evaluation of Computational Pipelines” and recommend “Robust Performance Metrics for Imbalanced Classification Problems” for deeper evaluation.
arXiv250913436v1 csSE
Abstract
As research increasingly relies on computational methods, the reliability of
scientific results depends on the quality, reproducibility, and transparency of
research software. Ensuring these qualities is critical for scientific
integrity and discovery. This paper asks whether Research Software Science
(RSS)--the empirical study of how research software is developed and
used--should be considered a form of metascience, the science of science.
Classification matters because it could affect recognition, funding, and
integration of RSS into research improvement. We define metascience and RSS,
compare their principles and objectives, and examine their overlaps. Arguments
for classification highlight shared commitments to reproducibility,
transparency, and empirical study of research processes. Arguments against
portraying RSS as a specialized domain focused on a tool rather than the
broader scientific enterprise. Our analysis finds RSS advances core goals of
metascience, especially in computational reproducibility, and bridges
technical, social, and cognitive aspects of research. Its classification
depends on whether one adopts a broad definition of metascience--any empirical
effort to improve science--or a narrow one focused on systemic and
epistemological structures. We argue RSS is best understood as a distinct
interdisciplinary domain that aligns with, and in some definitions fits within,
metascience. Recognizing it as such can strengthen its role in improving
reliability, justify funding, and elevate software development in research
institutions. Regardless of classification, applying scientific rigor to
research software ensures the tools of discovery meet the standards of the
discoveries themselves.
AI Insights - RSS adopts empirical methods akin to Empirical Software Engineering to quantify software quality metrics.
- The field’s core contribution is a reproducibility framework that maps software artifacts to experimental protocols.
- Literature such as Bennett’s An Introduction to Metascience and Mausfeld’s Epsilon‑Metascience contextualizes RSS within broader meta‑research debates.
- Ziemann et al.’s Five Pillars of Computational Reproducibility provides a practical checklist that RSS researchers routinely apply.
- The FORRT framework offers a training curriculum that integrates open‑source practices with rigorous reproducibility standards.
- RSS is positioned as an interdisciplinary bridge, linking cognitive science, sociology of science, and software engineering.
- Recognizing RSS as a distinct domain can unlock targeted funding streams and institutional support for research software development.