Hi!

Your personalized paper recommendations for 12 to 16 January, 2026.

Automatic debiased machine learning and sensitivity analysis for sample selection models

Massachusetts Institute of Technology MIT

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The omitted variable bias (OVB) in the sample selection model with confounding in selection can be represented as E[(g0-gs)(α0-αs)], where g0 and gs are the long and short outcome regressions, and α0 and αs are the corresponding long and short Riesz representers. [3]
The sensitivity parameter C^2_S measures the share of variation in the long representer that is not captured by the short representer. [3]
Omitted variable bias (OVB): the difference between the true parameter and the estimated parameter due to omitted variables. [3]
Sensitivity parameter C^2_S: measures the share of variation in the long representer that is not captured by the short representer. [3]
The sensitivity parameter C^2_S measures the gain in precision from observing the unobserved confounder A. [3]
The terms 1/(p_d(X)π(·)) grow when either the treatment propensity p_d(X) or the selection probability π(·) is small, summarizing the overlap and selection difficulty through an average inverse-probability scale. [2]
The OVB admits a representation of |θ0-θs|2 = ρ^2B^2 ≤ B^2, with B^2 being identified from the observed data. [1]
The OVB can be represented as E[(g0-gs)(α0-αs)], and its magnitude is bounded by B^2. [0]

Abstract
In this paper, we extend the Riesz representation framework to causal inference under sample selection, where both treatment assignment and outcome observability are non-random. Formulating the problem in terms of a Riesz representer enables stable estimation and a transparent decomposition of omitted variable bias into three interpretable components: a data-identified scale factor, outcome confounding strength, and selection confounding strength. For estimation, we employ the ForestRiesz estimator, which accounts for selective outcome observability while avoiding the instability associated with direct propensity score inversion. We assess finite-sample performance through a simulation study and show that conventional double machine learning approaches can be highly sensitive to tuning parameters due to their reliance on inverse probability weighting, whereas the ForestRiesz estimator delivers more stable performance by leveraging automatic debiased machine learning. In an empirical application to the gender wage gap in the U.S., we find that our ForestRiesz approach yields larger treatment effect estimates than a standard double machine learning approach, suggesting that ignoring sample selection leads to an underestimation of the gender wage gap. Sensitivity analysis indicates that implausibly strong unobserved confounding would be required to overturn our results. Overall, our approach provides a unified, robust, and computationally attractive framework for causal inference under sample selection.

Why we are recommending this paper?
Due to your Interest in Data Bias

This paper tackles the crucial issue of bias in causal inference, aligning directly with the user's interest in data fairness and AI ethics. The use of a Riesz representer offers a transparent and stable approach to addressing selection bias, a key concern in many fairness applications.

Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

University of Cambridge

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Stereotypical bias refers to the tendency of a model to perpetuate stereotypes or biases based on social categories such as gender, race, or ethnicity. [3]
Social bias refers to the tendency of a model to favor certain social groups over others, often based on implicit assumptions or cultural norms. [3]
Many researchers have proposed debiasing techniques for pre-trained language models. [2]

Abstract
Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.

Why we are recommending this paper?
Due to your Interest in AI Bias

Given the user's focus on AI bias and data representation, this paper’s investigation into biases within large language models is highly relevant. The proposed compute-efficient sandbox for debiasing directly addresses a significant challenge in mitigating bias in AI systems.

On the use of graph models to achieve individual and group fairness

Universidad Carlos III de Madrid

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

k-Nearest Neighbor (kNN) graph: A graph where two nodes are connected if they are among the k closest observations to each other according to a given metric. [3]
The text assumes a linear model is applied to the signal, which may not always be the case. [3]
The text draws inspiration from Opinion Dynamics, a concept in social network analysis. [3]
Signal aggregation and its implications on dataset analysis Imagine you're trying to understand a dataset with many points. [3]
The text discusses various topologies and their implications on signal aggregation in a dataset. [2]

Abstract
Machine Learning algorithms are ubiquitous in key decision-making contexts such as justice, healthcare and finance, which has spawned a great demand for fairness in these procedures. However, the theoretical properties of such models in relation with fairness are still poorly understood, and the intuition behind the relationship between group and individual fairness is still lacking. In this paper, we provide a theoretical framework based on Sheaf Diffusion to leverage tools based on dynamical systems and homology to model fairness. Concretely, the proposed method projects input data into a bias-free space that encodes fairness constrains, resulting in fair solutions. Furthermore, we present a collection of network topologies handling different fairness metrics, leading to a unified method capable of dealing with both individual and group bias. The resulting models have a layer of interpretability in the form of closed-form expressions for their SHAP values, consolidating their place in the responsible Artificial Intelligence landscape. Finally, these intuitions are tested on a simulation study and standard fairness benchmarks, where the proposed methods achieve satisfactory results. More concretely, the paper showcases the performance of the proposed models in terms of accuracy and fairness, studying available trade-offs on the Pareto frontier, checking the effects of changing the different hyper-parameters, and delving into the interpretation of its outputs.

Why we are recommending this paper?
Due to your Interest in Data Fairness

This work explores the application of graph models to fairness, a promising area given the user’s interest in fairness and data representation. The research provides a theoretical framework for understanding and achieving fairness within complex systems, aligning with the user's broader concerns.

Navigating Ethical AI Challenges in the Industrial Sector: Balancing Innovation and Responsibility

ABB AG

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Artificial intelligence (AI) is increasingly being used in various industries, including manufacturing, healthcare, and finance. [3]
The use of AI has several benefits, including improved efficiency, enhanced decision-making capabilities, and increased productivity. [3]
Artificial intelligence (AI): The development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. [3]
Machine learning: A subset of AI that involves the use of algorithms to enable computers to learn from data without being explicitly programmed. [3]
Deep learning: A type of machine learning that uses neural networks with multiple layers to analyze data. [3]
The increasing adoption of AI is expected to have a significant impact on various industries and aspects of society. [3]
However, the use of AI also raises several concerns, including job displacement, bias in decision-making, and security risks. [2]

Abstract
The integration of artificial intelligence (AI) into the industrial sector has not only driven innovation but also expanded the ethical landscape, necessitating a reevaluation of principles governing technology and its applications and awareness in research and development of industrial AI solutions. This chapter explores how AI-empowered industrial innovation inherently intersects with ethics, as advancements in AI introduce new challenges related to transparency, accountability, and fairness. In the chapter, we then examine the ethical aspects of several examples of AI manifestation in industrial use cases and associated factors such as ethical practices in the research and development process and data sharing. With the progress of ethical industrial AI solutions, we emphasize the importance of embedding ethical principles into industrial AI systems and its potential to inspire technological breakthroughs and foster trust among stakeholders. This chapter also offers actionable insights to guide industrial research and development toward a future where AI serves as an enabler for ethical and responsible industrial progress as well as a more inclusive industrial ecosystem.

Why we are recommending this paper?
Due to your Interest in Data Ethics

Considering the user’s interest in AI ethics and transparency, this paper’s focus on the ethical considerations of AI in the industrial sector is a strong match. It directly addresses the responsibility associated with AI innovation, a core theme within the user's profile.

AI as Entertainment

The Alan Turing Institute

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The authors contend that entertainment is a significant use case for AI, with people already using AI for activities unrelated to productivity. [3]
The paper suggests that this vision should inspire more debates, discourse, and study in the field of AI, as generative AI is increasingly being used for entertainment. [3]
AS: Artificially generated content GenAI: Generative AI Sociotechnical systems: Complex systems that combine social and technical components The paper concludes by emphasizing the need for a constructive vision of cultural AI, rather than just harm minimization. [3]
The paper argues that mainstream approaches to evaluating AI systems tend to focus on intelligence and harm minimization, but neglect the cultural dimension of AI use. [2]
They propose developing a positive theory of what beneficial, nutritious entertainment might look like, rather than just mitigating harms. [0]

Abstract
Generative AI systems are predominantly designed, evaluated, and marketed as intelligent systems which will benefit society by augmenting or automating human cognitive labor, promising to increase personal, corporate, and macroeconomic productivity. But this mainstream narrative about what AI is and what it can do is in tension with another emerging use case: entertainment. We argue that the field of AI is unprepared to measure or respond to how the proliferation of entertaining AI-generated content will impact society. Emerging data suggest AI is already widely adopted for entertainment purposes -- especially by young people -- and represents a large potential source of revenue. We contend that entertainment will become a primary business model for major AI corporations seeking returns on massive infrastructure investments; this will exert a powerful influence on the technology these companies produce in the coming years. Examining current evaluation practices, we identify a critical asymmetry: while AI assessments rigorously measure both benefits and harms of intelligence, they focus almost exclusively on cultural harms. We lack frameworks for articulating how cultural outputs might be actively beneficial. Drawing on insights from the humanities, we propose "thick entertainment" as a framework for evaluating AI-generated cultural content -- one that considers entertainment's role in meaning-making, identity formation, and social connection rather than simply minimizing harm. While AI is often touted for its potential to revolutionize productivity, in the long run we may find that AI turns out to be as much about "intelligence" as social media is about social connection.

Why we are recommending this paper?
Due to your Interest in AI Bias

This paper’s critical examination of the narrative surrounding AI’s role in society – particularly its framing as entertainment – is highly relevant to the user’s interest in AI transparency and representation. It prompts reflection on the potential biases embedded within the dominant discourse about AI.

Intersectional Data and the Social Cost of Digital Extraction: A Pigouvian Surcharge

Universidad Pontificia Comillas

Rate paper: 👍 👎 ♥ Save

AI Insights

It is an institutional intervention aimed at correcting a specific mechanism of exploitation within digital capitalism: the unpriced extraction of informational power from socially embedded identities. [3]
What is at stake is not merely who agrees to share data, but who bears the cost of prediction and who accumulates its benefits. [3]
Pigouvian surcharge: A monetary charge imposed on firms for extracting informational power from socially embedded identities without remuneration or control. [3]
It is a deliberately partial intervention that targets a concrete mechanism of exploitation within digital capitalism. [3]
The paper argues that contemporary data markets systematically underprice the social harm generated by predictive inference over protected, intersectional identities. [2]
The framework reframes privacy from a matter of individual consent to a question of political economy. [1]

Abstract
Contemporary digital capitalism relies on the large-scale extraction and commodification of personal data. Far from revealing isolated attributes, such data increasingly exposes intersectional social identities formed by combinations of race, gender, disability and others. This process generates a structural privacy externality: while firms appropriate economic value through profiling, prediction, and personalization, individuals and social groups bear diffuse costs in the form of heightened social risk, discrimination, and vulnerability. This paper develops a formal political economic framework to internalize these externalities by linking data valuation to information-theoretic measures. We propose a pricing rule based on mutual information that assigns monetary value to the entropy reduction induced by individual data points over joint intersectional identity distributions. Interpreted as a Pigouvian-style surcharge on data extraction, this mechanism functions as an institutional constraint on the asymmetric accumulation of informational power. A key advantage of the approach is its model-agnostic character: the valuation rule operates independently of the statistical structure used to estimate intersectional attributes, whether parametric, nonparametric, or machine-learned, and can be approximated through discretization of joint distributions. We argue that regulators can calibrate this surcharge to reflect contested social values, thereby embedding normative judgments directly into market design. By formalizing the social cost of intersectional data extraction, the proposed mechanism offers both a corrective to market failure and a redistributive institutional shield for vulnerable groups under conditions of digital asymmetry.

Why we are recommending this paper?
Due to your Interest in Data Ethics

Mixtures of Transparent Local Models

Laval University

Rate paper: 👍 👎 ♥ Save

AI Insights

The proposed algorithm, PBλ(S), is a PAC-Bayesian learning algorithm that produces mixtures of transparent local linear classifiers. [3]
The algorithm is able to produce interpretable models while achieving high performance on synthetic data sets with known points of interest. [3]
The addition of unnecessary points of interest has a negligible impact on the classifier's performance, but increases the value of its risk bound core component. [3]
PAC-Bayesian learning algorithm: A type of machine learning algorithm that produces probabilistic models and provides bounds on their generalization error. [3]
Mixtures of transparent local linear classifiers: A type of model that combines multiple transparent local linear classifiers to produce a single, more accurate classifier. [3]
Deterministic predictors: Predictors that always output the same value for a given input. [3]
Risk bound core component: The minimum risk bound obtained by the PAC-Bayesian learning algorithm. [3]
PBλ(S) works by identifying interesting regions and producing local separating hyperplanes that are better adapted to them. [2]

Abstract
The predominance of machine learning models in many spheres of human activity has led to a growing demand for their transparency. The transparency of models makes it possible to discern some factors, such as security or non-discrimination. In this paper, we propose a mixture of transparent local models as an alternative solution for designing interpretable (or transparent) models. Our approach is designed for the situations where a simple and transparent function is suitable for modeling the label of instances in some localities/regions of the input space, but may change abruptly as we move from one locality to another. Consequently, the proposed algorithm is to learn both the transparent labeling function and the locality of the input space where the labeling function achieves a small risk in its assigned locality. By using a new multi-predictor (and multi-locality) loss function, we established rigorous PAC-Bayesian risk bounds for the case of binary linear classification problem and that of linear regression. In both cases, synthetic data sets were used to illustrate how the learning algorithms work. The results obtained from real data sets highlight the competitiveness of our approach compared to other existing methods as well as certain opaque models. Keywords: PAC-Bayes, risk bounds, local models, transparent models, mixtures of local transparent models.

Why we are recommending this paper?
Due to your Interest in Data Transparency

On the Leaky Private Information Retrieval with Side Information

Huawei

Rate paper: 👍 👎 ♥ Save

AI Insights

The download cost of the proposed scheme is given by D = 1 + 1/(N-1) * (1 - 1 / ((N-1)e^(-ε) + 1)^K-2). [3]
Download cost D: The total amount of data downloaded by the user to obtain the desired information. [3]
The proposed scheme satisfies the condition of ε-leaky (W,S)-privacy, which means that for any two pairs of demand and side information indices (W,S) and (W′,S′), the ratio Pr[Q[W,S]n = q|W,S] / Pr[Q[W,S]n = q|W′,S′] can take only three possible values: either 1, or e−ε, or eε. [1]

Abstract
This paper investigates the problem of leaky-private Private Information Retrieval with Side Information (L-PIR-SI), which relaxes the requirement of perfect privacy to achieve improved communication efficiency in the presence of side information. While the capacities of PIR-SI under both $W$-privacy and $(W,S)$-privacy have been partially explored, the impact of controlled information leakage in these settings remains unaddressed. We propose a unified probabilistic framework to construct L-PIR-SI schemes where the privacy leakage is quantified by a parameter $\varepsilon$, consistent with differential privacy standards. We characterize the achievable download costs and show that our results generalize several landmark results in the PIR literature: they recover the capacity of PIR-SI when $\varepsilon \to 0$, and reduce to the known bounds for leaky-PIR when side information is absent. This work provides the first look at the trade-offs between leakage, side information, and retrieval efficiency.

Why we are recommending this paper?
Due to your Interest in Data Transparency

Data-Induced Groupings and How To Find Them

University of Illinois UrbanaChampaign

Rate paper: 👍 👎 ♥ Save

AI Insights

The study explores how users rely on different features when identifying groups in a visualization, specifically when the x-axis is nominal. [3]
The researchers developed a predictive model that can identify data-induced groupings based on co-linearity and other features. [3]
The model achieved an F1 score of 0.99 for the Constellation study and 0.97 for the Contextual study. [3]
It can automatically evaluate different permutations of the x-axis and recommend those that minimize violations and adhere to hierarchy constraints. [3]
Data-induced grouping: A grouping of points in a visualization that is not based on any meaningful relationship between the data, but rather on the way the data is presented. [3]
Co-linearity: A feature used by the predictive model to identify groups based on the linear relationship between two or more variables. [3]
The results show that users tend to rely more on y-separation and co-linearity under the Contextual condition compared to the Constellation-based analysis. [2]

Abstract
Making sense of a visualization requires the reader to consider both the visualization design and the underlying data values. Existing work in the visualization community has largely considered affordances driven by visualization design elements, such as color or chart type, but how visual design interacts with data values to impact interpretation and reasoning has remained under-explored. Dot plots and bar graphs are commonly used to help users identify groups of points that form trends and clusters, but are liable to manifest groupings that are artifacts of spatial arrangement rather than inherent patterns in the data itself. These ``Data-induced Groups'' can drive suboptimal data comparisons and potentially lead the user to incorrect conclusions. We conduct two user studies using dot plots as a case study to understand the prevalence of data-induced groupings. We find that users rely on data-induced groupings in both conditions despite the fact that trend-based groupings are irrelevant in nominal data. Based on the study results, we build a model to predict whether users are likely to perceive a given set of dot plot points as a group. We discuss two use cases illustrating how the model can assist visualization designers by both diagnosing potential user-perceived groupings in dot plots and offering redesigns that better accentuate desired groupings through data rearrangement.

Why we are recommending this paper?
Due to your Interest in Data Representation

TiInsight: A SQL-based Automated Exploratory Data Analysis System through Large Language Models

PingCAP

Rate paper: 👍 👎 ♥ Save

AI Insights

HDC generation involves extracting representative entities for each database to facilitate efficient data exploration across multiple databases. [3]
It also includes a self-refinement chain to correct errors in generated SQL statements. [3]
The system demonstrates its capabilities through two real-world scenarios: the Financial dataset and the Bird dataset, showcasing its ability to provide insights and facilitate user-system interaction. [3]
HDC: Hierarchical Data Context - a summary of the data that includes a description, keywords, table information, and more. [3]
TiChart: Chart Selection - a component that selects the most suitable chart type to present analysis results by visualization. [3]
Exploration Efficiency: The ability of the system to efficiently explore data across multiple databases. [3]
TiInsight is a SQL-based automated cross-domain exploratory data analysis system that utilizes large language models to facilitate user-system interaction and provide powerful hierarchical data context (HDC) generation, text-to-SQL (TiSQL), chart selection (TiChart), and exploration efficiency. [2]
TiSQL is a schema filtering framework based on the map-reduce paradigm that filters tables and columns using clarified questions and cosine similarity. [1]

Abstract
The SQL-based exploratory data analysis has garnered significant attention within the data analysis community. The emergence of large language models (LLMs) has facilitated the paradigm shift from manual to automated data exploration. However, existing methods generally lack the ability for cross-domain analysis, and the exploration of LLMs capabilities remains insufficient. This paper presents TiInsight, an SQL-based automated cross-domain exploratory data analysis system. First, TiInsight offers a user-friendly GUI enabling users to explore data using natural language queries. Second, TiInsight offers a robust cross-domain exploratory data analysis pipeline: hierarchical data context (i.e., HDC) generation, question clarification and decomposition, text-to-SQL (i.e., TiSQL), and data visualization (i.e., TiChart). Third, we have implemented and deployed TiInsight in the production environment of PingCAP and demonstrated its capabilities using representative datasets. The demo video is available at https://youtu.be/JzYFyYd-emI.

Why we are recommending this paper?
Due to your Interest in Data Representation

Continuous Fairness On Data Streams

New Jersey Institute of Technology NJIT

Rate paper: 👍 👎 ♥ Save

AI Insights

It introduces two new algorithms: Monitor-BFair for preprocessing and BFair-ReOrder for reordering. [3]
The framework is evaluated on four real-world datasets, showing that it outperforms baselines by 2-3 orders of magnitude in terms of efficiency. [3]
It effectively enforces fairness at the block level, substantially increasing the percentage of fair blocks when BFair-ReOrder is applied. [3]
The paper proposes a framework for continuous fairness monitoring and reordering in data streams. [2]

Abstract
We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.

Why we are recommending this paper?
Due to your Interest in Data Fairness

Fairness risk and its privacy-enabled solution in AI-driven robotic applications

University of Groningen

Rate paper: 👍 👎 ♥ Save

AI Insights

Lipschitz constant (LA): a measure of how much the function g(u, x, a) changes with respect to A. [3]
The paper introduces a new fairness notion called g-fairness, which measures the difference in expected outcomes between two groups. [2]
It shows that if a mechanism is differentially private with respect to A, then the g-fairness metric is bounded by εA + log(1 + LA diam(A) + δA γ / τ), where LA is the Lipschitz constant with respect to A. [1]

Abstract
Complex decision-making by autonomous machines and algorithms could underpin the foundations of future society. Generative AI is emerging as a powerful engine for such transitions. However, we show that Generative AI-driven developments pose a critical pitfall: fairness concerns. In robotic applications, although intuitions about fairness are common, a precise and implementable definition that captures user utility and inherent data randomness is missing. Here we provide a utility-aware fairness metric for robotic decision making and analyze fairness jointly with user-data privacy, deriving conditions under which privacy budgets govern fairness metrics. This yields a unified framework that formalizes and quantifies fairness and its interplay with privacy, which is tested in a robot navigation task. In view of the fact that under legal requirements, most robotic systems will enforce user privacy, the approach shows surprisingly that such privacy budgets can be jointly used to meet fairness targets. Addressing fairness concerns in the creative combined consideration of privacy is a step towards ethical use of AI and strengthens trust in autonomous robots deployed in everyday environments.

Why we are recommending this paper?
Due to your Interest in AI Fairness

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

AI Transparency
AI Ethics

You can edit or add more interests any time.

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback