Department of Biostatistics, Yale School of Public Health, Yale University
Abstract
Data-driven decisions shape public health policies and practice, yet
persistent disparities in data representation skew insights and undermine
interventions. To address this, we advance a structured roadmap that integrates
public health data science with computer science and is grounded in
reflexivity. We adopt data equity as a guiding concept: ensuring the fair and
inclusive representation, collection, and use of data to prevent the
introduction or exacerbation of systemic biases that could lead to invalid
downstream inference and decisions. To underscore urgency, we present three
public health cases where non-representative datasets and skewed knowledge
impede decisions across diverse subgroups. These challenges echo themes in two
literatures: public health highlights gaps in high-quality data for specific
populations, while computer science and statistics contribute criteria and
metrics for diagnosing bias in data and models. Building on these foundations,
we propose a working definition of public health data equity and a structured
self-audit framework. Our framework integrates core computational principles
(fairness, accountability, transparency, ethics, privacy, confidentiality) with
key public health considerations (selection bias, representativeness,
generalizability, causality, information bias) to guide equitable practice
across the data life cycle, from study design and data collection to
measurement, analysis, interpretation, and translation. Embedding data equity
in routine practice offers a practical path for ensuring that data-driven
policies, artificial intelligence, and emerging technologies improve health
outcomes for all. Finally, we emphasize the critical understanding that,
although data equity is an essential first step, it does not inherently
guarantee information, learning, or decision equity.