Brookhaven National Lab
Abstract
Large-scale international collaborations such as ATLAS rely on globally
distributed workflows and data management to process, move, and store vast
volumes of data. ATLAS's Production and Distributed Analysis (PanDA) workflow
system and the Rucio data management system are each highly optimized for their
respective design goals. However, operating them together at global scale
exposes systemic inefficiencies, including underutilized resources, redundant
or unnecessary transfers, and altered error distributions. Moreover, PanDA and
Rucio currently lack shared performance awareness and coordinated, adaptive
strategies.
This work charts a path toward co-optimizing the two systems by diagnosing
data-management pitfalls and prioritizing end-to-end improvements. With the
observation of spatially and temporally imbalanced transfer activities, we
develop a metadata-matching algorithm that links PanDA jobs and Rucio datasets
at the file level, yielding a complete, fine-grained view of data access and
movement. Using this linkage, we identify anomalous transfer patterns that
violate PanDA's data-centric job-allocation principle. We then outline
mitigation strategies for these patterns and highlight opportunities for
tighter PanDA-Rucio coordination to improve resource utilization, reduce
unnecessary data movement, and enhance overall system resilience.
University College Cork
Abstract
This chapter presents a comprehensive taxonomy for assessing data quality in
the context of data monetisation, developed through a systematic literature
review. Organising over one hundred metrics and Key Performance Indicators
(KPIs) into four subclusters (Fundamental, Contextual, Resolution, and
Specialised) within the Balanced Scorecard (BSC) framework, the taxonomy
integrates both universal and domain-specific quality dimensions. By
positioning data quality as a strategic connector across the BSC's Financial,
Customer, Internal Processes, and Learning & Growth perspectives, it
demonstrates how quality metrics underpin valuation accuracy, customer trust,
operational efficiency, and innovation capacity. The framework's interconnected
"metrics layer" ensures that improvements in one dimension cascade into others,
maximising strategic impact. This holistic approach bridges the gap between
granular technical assessment and high-level decision-making, offering
practitioners, data stewards, and strategists a scalable, evidence-based
reference for aligning data quality management with sustainable value creation.