Data Bias

Correcting sample selection bias with categorical outcomes

arXiv

Abstract
In this paper, we propose a method for correcting sample selection bias when the outcome of interest is categorical, such as occupational choice, health status, or field of study. Classical approaches to sample selection rely on strong parametric distributional assumptions, which may be restrictive in practice. While the recent framework of Chernozhukov et al. (2023) offers a nonparametric identification using a local Gaussian representation (LGR) that holds for any bivariate joint distributions. This makes this approach limited to ordered discrete outcomes. We therefore extend it by developing a local representation that applies to joint probabilities, thereby eliminating the need to impose an artificial ordering on categories. Our representation decomposes each joint probability into marginal probabilities and a category-specific association parameter that captures how selection differentially affects each outcome. Under exclusion restrictions analogous to those in the LGR model, we establish nonparametric point identification of the latent categorical distribution. Building on this identification result, we introduce a semiparametric multinomial logit model with sample selection, propose a computationally tractable two-step estimator, and derive its asymptotic properties. This framework significantly broadens the set of tools available for analyzing selection in categorical and other discrete outcomes, offering substantial relevance for empirical work across economics, health sciences, and social sciences.

👍 👎 ♥ Save

Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

MPISWS and Saarland Unv

Abstract
Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.

AI Insights

RAG with BGE‑M3 as retriever boosts bias‑detection F1 by 4‑6 points over vanilla BGE‑M3.
Switching from default to class‑balanced loss during fine‑tuning yields negligible change in multi‑label performance.
Prompting and fine‑tuning achieve per‑label F1 scores above 0.85 across all demographic axes.
Retrieval‑augmented generation also excels in long‑form QA, as shown in the “Retrieval‑Augmented Generation for Long‑Form Knowledge Base Questions and Answers” study.
Multi‑hop reasoning chains, when enriched with retrieval, lift answer accuracy by 12% in the “Improving Multi‑Hop Question Answering via Reasoning Chain Augmentation” paper.
RAG‑based in‑context learning is not universally effective; its gains diminish on highly specialized, low‑resource datasets.
The term “Retrieval‑Augmented Generation” refers to a retriever‑generator pipeline that fetches relevant documents to condition the language model.

Data Transparency

👍 👎 ♥ Save

Towards Meaningful Transparency in Civic AI Systems

TU Delft

Abstract
Artificial intelligence has become a part of the provision of governmental services, from making decisions about benefits to issuing fines for parking violations. However, AI systems rarely live up to the promise of neutral optimisation, creating biased or incorrect outputs and reducing the agency of both citizens and civic workers to shape the way decisions are made. Transparency is a principle that can both help subjects understand decisions made about them and shape the processes behind those decisions. However, transparency as practiced around AI systems tends to focus on the production of technical objects that represent algorithmic aspects of decision making. These are often difficult for publics to understand, do not connect to potential for action, and do not give insight into the wider socio-material context of decision making. In this paper, we build on existing approaches that take a human-centric view on AI transparency, combined with a socio-technical systems view, to develop the concept of meaningful transparency for civic AI systems: transparencies that allow publics to engage with AI systems that affect their lives, connecting understanding with potential for action.

👍 👎 ♥ Save

Aegis: A Correlation-Based Data Masking Advisor for Data Sharing Ecosystems

Abstract
Data-sharing ecosystems enable entities -- such as providers, consumers, and intermediaries -- to access, exchange, and utilize data for various downstream tasks and applications. Due to privacy concerns, data providers typically anonymize datasets before sharing them; however, the existence of multiple masking configurations results in masked datasets with varying utility. Consequently, a key challenge lies in efficiently determining the optimal masking configuration that maximizes a dataset's utility. This paper presents AEGIS, a middleware framework for identifying the optimal masking configuration for machine learning datasets that consist of features and a class label. We introduce a utility optimizer that minimizes predictive utility deviation -- a metric based on the changes in feature-label correlations before and after masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature-label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature-label correlation quantification methods such as g3, mutual information, or chi-square. Our experimental evaluation on real-world datasets shows that AEGIS identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks that is on par with baseline approaches.

Data Fairness

👍 👎 ♥ Save

Fairness in Repeated Matching: A Maximin Perspective

National University of Sg

Abstract
We study a sequential decision-making model where a set of items is repeatedly matched to the same set of agents over multiple rounds. The objective is to determine a sequence of matchings that either maximizes the utility of the least advantaged agent at the end of all rounds (optimal) or at the end of every individual round (anytime optimal). We investigate the computational challenges associated with finding (anytime) optimal outcomes and demonstrate that these problems are generally computationally intractable. However, we provide approximation algorithms, fixed-parameter tractable algorithms, and identify several special cases whereby the problem(s) can be solved efficiently. Along the way, we also establish characterizations of Pareto-optimal/maximum matchings, which may be of independent interest to works in matching theory and house allocation.

AI Insights

For the ERM model with identical valuations and horizon T = kn, a polynomial‑time algorithm constructs an optimal sequence of matchings by reducing the problem to the well‑studied PBP problem.
The same reduction yields a fast approximation scheme: for every round t the algorithm guarantees b_t(S) ≥ OPT(t) – Δ, where Δ is the value gap between the top‑valued and second‑valued good.
When valuations are identical, the optimal sequence can be found in O(n³) time by solving a series of bipartite matching instances derived from the PBP formulation.
The paper’s characterization of Pareto‑optimal matchings extends to the repeated setting, showing that any sequence achieving the maximin objective must also be Pareto‑optimal at each round.
For practitioners, the authors recommend exploring the literature on the PBP problem (e.g., “Polynomial‑time algorithms for the PBP problem” by Smith et al.) to understand the underlying reduction.
A curious reader might also investigate the fixed‑parameter tractable algorithms presented for the general ERM model, which exploit the number of agents as a parameter to break the NP‑hard barrier.

👍 👎 ♥ Save

Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms

Abstract
The growing philosophical literature on algorithmic fairness has examined statistical criteria such as equalized odds and calibration, causal and counterfactual approaches, and the role of structural and compounding injustices. Yet an important dimension has been overlooked: whether the evidential value of an algorithmic output itself depends on structural injustice. Our paradigmatic pair of examples contrasts a predictive policing algorithm, which relies on historical crime data, with a camera-based system that records ongoing offenses, both designed to guide police deployment. In evaluating the moral acceptability of acting on a piece of evidence, we must ask not only whether the evidence is probative in the actual world, but also whether it would remain probative in nearby worlds without the relevant injustices. The predictive policing algorithm fails this test, but the camera-based system passes it. When evidence fails the test, it is morally problematic to use it punitively, more so than evidence that passes the test.

AI Fairness

👍 👎 ♥ Save

Disclosure and Evaluation as Fairness Interventions for General-Purpose AI

Stanford University

Abstract
Despite conflicting definitions and conceptions of fairness, AI fairness researchers broadly agree that fairness is context-specific. However, when faced with general-purpose AI, which by definition serves a range of contexts, how should we think about fairness? We argue that while we cannot be prescriptive about what constitutes fair outcomes, we can specify the processes that different stakeholders should follow in service of fairness. Specifically, we consider the obligations of two major groups: system providers and system deployers. While system providers are natural candidates for regulatory attention, the current state of AI understanding offers limited insight into how upstream factors translate into downstream fairness impacts. Thus, we recommend that providers invest in evaluative research studying how model development decisions influence fairness and disclose whom they are serving their models to, or at the very least, reveal sufficient information for external researchers to conduct such research. On the other hand, system deployers are closer to real-world contexts and can leverage their proximity to end users to address fairness harms in different ways. Here, we argue they should responsibly disclose information about users and personalization and conduct rigorous evaluations across different levels of fairness. Overall, instead of focusing on enforcing fairness outcomes, we prioritize intentional information-gathering by system providers and deployers that can facilitate later context-aware action. This allows us to be specific and concrete about the processes even while the contexts remain unknown. Ultimately, this approach can sharpen how we distribute fairness responsibilities and inform more fluid, context-sensitive interventions as AI continues to advance.

AI Insights

Generative AI can boost STEM learning in underfunded schools, yet deployment must be ethically vetted.
Actionable Auditing shows that publicly naming biased performance can spur vendor reform.
“Closing the AI Accountability Gap” proposes an end‑to‑end internal auditing framework for commercial models.
“Fairness and Abstraction in Sociotechnical Systems” insists fairness be built into design, not added later.
“The Algorithmic Society” examines how tech reshapes power, urging interdisciplinary oversight.
Surveys now favor context‑specific bias metrics over single‑threshold tests.
AI ethics studies moral principles guiding AI, while bias denotes unfair patterns from data or algorithms.

Data Ethics

👍 👎 ♥ Save

Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

Rate this image: 😍 👍 👎

Abstract
Over the past decade, an ecosystem of measures has emerged to evaluate the social and ethical implications of AI systems, largely shaped by high-level ethics principles. These measures are developed and used in fragmented ways, without adequate attention to how they are situated in AI systems. In this paper, we examine how existing measures used in the computing literature map to AI system components, attributes, hazards, and harms. Our analysis draws on a scoping review resulting in nearly 800 measures corresponding to 11 AI ethics principles. We find that most measures focus on four principles - fairness, transparency, privacy, and trust - and primarily assess model or output system components. Few measures account for interactions across system elements, and only a narrow set of hazards is typically considered for each harm type. Many measures are disconnected from where harm is experienced and lack guidance for setting meaningful thresholds. These patterns reveal how current evaluation practices remain fragmented, measuring in pieces rather than capturing how harms emerge across systems. Framing measures with respect to system attributes, hazards, and harms can strengthen regulatory oversight, support actionable practices in industry, and ground future research in systems-level understanding.

👍 👎 ♥ Save

Challenges in designing ethical rules for Infrastructures in Internet of Vehicles

Abstract
Vehicular Ad-hoc Networks (VANETs) have seen significant advancements in technology. Innovation in connectivity and communication has brought substantial capabilities to various components of VANETs such as vehicles, infrastructures, passengers, drivers and affiliated environmental sensors. Internet of Things (IoT) has brought the notion of Internet of Vehicles (IoV) to VANETs where each component of VANET is connected directly or indirectly to the Internet. Vehicles and infrastructures are key components of a VANET system that can greatly augment the overall experience of the network by integrating the competencies of Vehicle to Vehicle (V2V), Vehicle to Pedestrian (V2P), Vehicle to Sensor (V2S), Vehicle to Infrastructure (V2I) and Infrastructure to Infrastructure (I2I). Internet connectivity in Vehicles and Infrastructures has immensely expanded the potential of developing applications for VANETs under the broad spectrum of IoV. Advent in the use of technology in VANETs requires considerable efforts in scheming the ethical rules for autonomous systems. Currently, there is a gap in literature that focuses on the challenges involved in designing ethical rules or policies for infrastructures, sometimes referred to as Road Side Units (RSUs) for IoVs. This paper highlights the key challenges entailing the design of ethical rules for RSUs in IoV systems. Furthermore, the article also proposes major ethical principles for RSUs in IoV systems that would set foundation for modeling future IoV architectures.

Data Representation

👍 👎 ♥ Save

Beyond Real Data: Synthetic Data through the Lens of Regularization

University of Oxford, Big

Abstract
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.

AI Insights

A closed‑form formula for the optimal synthetic‑to‑real ratio is derived via algorithmic stability and Wasserstein distance.
Kernel ridge regression analysis reveals a U‑shaped test‑error curve that can be plotted directly.
Domain adaptation is addressed by mixing synthetic target data with limited source samples to curb shift.
Synthetic data is framed as a knowledge‑distillation teacher, providing a soft regularizer.
Experiments on CIFAR‑10 and a brain‑MRI dataset confirm the predicted optimal ratio.
The method assumes a fixed synthetic generator and does not filter data quality, a noted limitation.
For background, consult Bishop’s Pattern Recognition and Machine Learning and Goodfellow et al.’s Deep Learning.

👍 👎 ♥ Save

What Do We Mean When We Talk About Data Storytelling?

Abstract
We have witnessed rapid growth in data storytelling research. Scholars from multiple disciplines have contributed new theories and techniques surrounding data storytelling. However, with this prolific development, a fuzzy boundary of data storytelling comes. We argue that understanding how "data storytelling" has been defined and interpreted by academia is crucial for facilitating communication between researchers, encouraging the consistent use of concepts and measures, assisting newcomers in approaching and positioning their research in this area, and enabling the effective application of relevant techniques and tools. Thus, it is necessary to systematically reflect on "what is data storytelling" and promote a more thorough understanding of this concept. Specifically, we investigated how existing research has conceptualized "data storytelling." As a result, we identified 96 publications that provide explicit definitions. By coding these definitions in-depth, we identified five paradigms of defining data storytelling, as well as a broad spectrum of interpretations regarding the content, objectives, and techniques of data storytelling. Finally, we concluded with implications for future research, aiming to foster nuanced communication about "data storytelling," suggest research opportunities, and establish a more inclusive theoretical foundation for this research direction.

AI Bias

👍 👎 ♥ Save

Benchmarking is Broken -- Don't Let AI be its Own Judge

Princeton University, CIS

Abstract
The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.

AI Insights

PeerBench introduces sealed execution, item banking with rolling renewal, and delayed transparency to curb data contamination.
The community‑governed, proctored framework ensures benchmark items are refreshed continuously, preventing stale or biased tests.
Holistic evaluation, defined as assessing fairness, safety, and interpretability together, is essential for trustworthy AI progress.
Data contamination, the intentional or accidental alteration of datasets to bias results, remains a critical vulnerability in LLM benchmarks.
“Can We Trust AI Benchmarks?” and “BetterBench: Assessing AI Benchmarks” provide actionable best‑practice guidelines for researchers.
The video “AI as a Sport: On the Competitive Epistemologies of Benchmarking” illustrates how current tests resemble a zero‑sum game.
The paper’s call for a unified, live benchmarking ecosystem echoes the rigor of SAT/GRE fairness audits, promising renewed public confidence.

👍 👎 ♥ Save

Dr. Bias: Social Disparities in AI-Powered Medical Guidance

Abstract
With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.

AI Ethics

👍 👎 ♥ Save

Making Power Explicable in AI: Analyzing, Understanding, and Redirecting Power to Operationalize Ethics in AI Technical Practice

Abstract
The operationalization of ethics in the technical practices of artificial intelligence (AI) is facing significant challenges. To address the problem of ineffective implementation of AI ethics, we present our diagnosis, analysis, and interventional recommendations from a unique perspective of the real-world implementation of AI ethics through explainable AI (XAI) techniques. We first describe the phenomenon (i.e., the "symptoms") of ineffective implementation of AI ethics in explainable AI using four empirical cases. From the "symptoms", we diagnose the root cause (i.e., the "disease") being the dysfunction and imbalance of power structures in the sociotechnical system of AI. The power structures are dominated by unjust and unchecked power that does not represent the benefits and interests of the public and the most impacted communities, and cannot be countervailed by ethical power. Based on the understanding of power mechanisms, we propose three interventional recommendations to tackle the root cause, including: 1) Making power explicable and checked, 2) Reframing the narratives and assumptions of AI and AI ethics to check unjust power and reflect the values and benefits of the public, and 3) Uniting the efforts of ethical and scientific conduct of AI to encode ethical values as technical standards, norms, and methods, including conducting critical examinations and limitation analyses of AI technical practices. We hope that our diagnosis and interventional recommendations can be a useful input to the AI community and civil society's ongoing discussion and implementation of ethics in AI for ethical and responsible AI practice.

AI Transparency

👍 👎 ♥ Save

Trust in Transparency: How Explainable AI Shapes User Perceptions

University of Maryland

Abstract
This study explores the integration of contextual explanations into AI-powered loan decision systems to enhance trust and usability. While traditional AI systems rely heavily on algorithmic transparency and technical accuracy, they often fail to account for broader social and economic contexts. Through a qualitative study, I investigated user interactions with AI explanations and identified key gaps, in- cluding the inability of current systems to provide context. My findings underscore the limitations of purely technical transparency and the critical need for contex- tual explanations that bridge the gap between algorithmic outputs and real-world decision-making. By aligning explanations with user needs and broader societal factors, the system aims to foster trust, improve decision-making, and advance the design of human-centered AI systems

AI Insights

User trust in XAI depends on explanation accuracy, completeness, consistency, and continuity—metrics beyond model performance.
Compact, compositional explanations enhance trust, yet must balance brevity with contextual depth.
Confidence indicators matter only when they reveal the information users value, like counterfactuals or uncertainty bounds.
Coherence and controllability let users steer explanations, turning passive observers into active decision partners.
Addressing “why not?” and “what if?” scenarios turns explanations into actionable insights for high‑stakes domains.
For deeper dives, read Lipton’s The Mythos of Model Interpretability, Miller’s social‑science lens, and Nguyen et al.’s 2021 XAI review.

Help us improve your experience!