SageBionetworks, OregonHe
Abstract
Continuous and reliable access to curated biological data repositories is
indispensable for accelerating rigorous scientific inquiry and fostering
reproducible research. Centralized repositories, though widely used, are
vulnerable to single points of failure arising from cyberattacks, technical
faults, natural disasters, or funding and political uncertainties. This can
lead to widespread data unavailability, data loss, integrity compromises, and
substantial delays in critical research, ultimately impeding scientific
progress. Centralizing essential scientific resources in a single geopolitical
or institutional hub is inherently dangerous, as any disruption can paralyze
diverse ongoing research. The rapid acceleration of data generation, combined
with an increasingly volatile global landscape, necessitates a critical
re-evaluation of the sustainability of centralized models. Implementing
federated and decentralized architectures presents a compelling and
future-oriented pathway to substantially strengthen the resilience of
scientific data infrastructures, thereby mitigating vulnerabilities and
ensuring the long-term integrity of data. Here, we examine the structural
limitations of centralized repositories, evaluate federated and decentralized
models, and propose a hybrid framework for resilient, FAIR, and sustainable
scientific data stewardship. Such an approach offers a significant reduction in
exposure to governance instability, infrastructural fragility, and funding
volatility, and also fosters fairness and global accessibility. The future of
open science depends on integrating these complementary approaches to establish
a globally distributed, economically sustainable, and institutionally robust
infrastructure that safeguards scientific data as a public good, further
ensuring continued accessibility, interoperability, and preservation for
generations to come.
AI Insights - EOSC’s federated nodes already host 1 million genomes, a living model of distributed stewardship.
- ELIXIR’s COVID‑19 response proved community pipelines can scale to pandemic‑grade data volumes.
- The Global Biodata Coalition’s roadmap envisions a cross‑border mesh that outpaces single‑point failure risks.
- DeSci employs blockchain provenance to give researchers immutable audit trails for every dataset.
- NIH’s Final Data Policy now mandates FAIR compliance, nudging institutions toward hybrid decentralized architectures.
- DeSci still struggles with interoperability, as heterogeneous metadata schemas block seamless cross‑platform queries.
- Privacy‑by‑design in distributed repositories remains a top research gap, inviting novel cryptographic solutions.
Abstract
LLMs promise to democratize technical work in complex domains like
programmatic data analysis, but not everyone benefits equally. We study how
students with varied expertise use LLMs to complete Python-based data analysis
in computational notebooks in a non-major course. Drawing on homework logs,
recordings, and surveys from 36 students, we ask: Which expertise matters most,
and how does it shape AI use? Our mixed-methods analysis shows that technical
expertise -- not AI familiarity or communication skills -- remains a
significant predictor of success. Students also vary widely in how they
leverage LLMs, struggling at stages of forming intent, expressing inputs,
interpreting outputs, and assessing results. We identify success and failure
behaviors, such as providing context or decomposing prompts, that distinguish
effective use. These findings inform AI literacy interventions, highlighting
that lightweight demonstrations improve surface fluency but are insufficient;
deeper training and scaffolds are needed to cultivate resilient AI use skills.