Department of Machine Learning, MBZUAI, Abu Dhabi, UAE
Abstract
Imagine decision-makers uploading data and, within minutes, receiving clear,
actionable insights delivered straight to their fingertips. That is the promise
of the AI Data Scientist, an autonomous Agent powered by large language models
(LLMs) that closes the gap between evidence and action. Rather than simply
writing code or responding to prompts, it reasons through questions, tests
ideas, and delivers end-to-end insights at a pace far beyond traditional
workflows. Guided by the scientific tenet of the hypothesis, this Agent
uncovers explanatory patterns in data, evaluates their statistical
significance, and uses them to inform predictive modeling. It then translates
these results into recommendations that are both rigorous and accessible. At
the core of the AI Data Scientist is a team of specialized LLM Subagents, each
responsible for a distinct task such as data cleaning, statistical testing,
validation, and plain-language communication. These Subagents write their own
code, reason about causality, and identify when additional data is needed to
support sound conclusions. Together, they achieve in minutes what might
otherwise take days or weeks, enabling a new kind of interaction that makes
deep data science both accessible and actionable.
Department of Biostatistics, Yale School of Public Health, Yale University
Abstract
Data-driven decisions shape public health policies and practice, yet
persistent disparities in data representation skew insights and undermine
interventions. To address this, we advance a structured roadmap that integrates
public health data science with computer science and is grounded in
reflexivity. We adopt data equity as a guiding concept: ensuring the fair and
inclusive representation, collection, and use of data to prevent the
introduction or exacerbation of systemic biases that could lead to invalid
downstream inference and decisions. To underscore urgency, we present three
public health cases where non-representative datasets and skewed knowledge
impede decisions across diverse subgroups. These challenges echo themes in two
literatures: public health highlights gaps in high-quality data for specific
populations, while computer science and statistics contribute criteria and
metrics for diagnosing bias in data and models. Building on these foundations,
we propose a working definition of public health data equity and a structured
self-audit framework. Our framework integrates core computational principles
(fairness, accountability, transparency, ethics, privacy, confidentiality) with
key public health considerations (selection bias, representativeness,
generalizability, causality, information bias) to guide equitable practice
across the data life cycle, from study design and data collection to
measurement, analysis, interpretation, and translation. Embedding data equity
in routine practice offers a practical path for ensuring that data-driven
policies, artificial intelligence, and emerging technologies improve health
outcomes for all. Finally, we emphasize the critical understanding that,
although data equity is an essential first step, it does not inherently
guarantee information, learning, or decision equity.