Data Science Management

The AI Data Scientist

Department of Machine Learning, MBZUAI, Abu Dhabi, UAE

Abstract
Imagine decision-makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (LLMs) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end-to-end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized LLM Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain-language communication. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.

August 25, 2025

♥Save to Reading List

Towards Enhancing Data Equity in Public Health Data Science

Department of Biostatistics, Yale School of Public Health, Yale University

Abstract
Data-driven decisions shape public health policies and practice, yet persistent disparities in data representation skew insights and undermine interventions. To address this, we advance a structured roadmap that integrates public health data science with computer science and is grounded in reflexivity. We adopt data equity as a guiding concept: ensuring the fair and inclusive representation, collection, and use of data to prevent the introduction or exacerbation of systemic biases that could lead to invalid downstream inference and decisions. To underscore urgency, we present three public health cases where non-representative datasets and skewed knowledge impede decisions across diverse subgroups. These challenges echo themes in two literatures: public health highlights gaps in high-quality data for specific populations, while computer science and statistics contribute criteria and metrics for diagnosing bias in data and models. Building on these foundations, we propose a working definition of public health data equity and a structured self-audit framework. Our framework integrates core computational principles (fairness, accountability, transparency, ethics, privacy, confidentiality) with key public health considerations (selection bias, representativeness, generalizability, causality, information bias) to guide equitable practice across the data life cycle, from study design and data collection to measurement, analysis, interpretation, and translation. Embedding data equity in routine practice offers a practical path for ensuring that data-driven policies, artificial intelligence, and emerging technologies improve health outcomes for all. Finally, we emphasize the critical understanding that, although data equity is an essential first step, it does not inherently guarantee information, learning, or decision equity.

August 27, 2025

♥Save to Reading List

Paid Search

Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

University of Leeds - School of Computing (AI for Language)

Abstract
This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG's capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

August 26, 2025

♥Save to Reading List

Semantic Search for Information Retrieval

Kayla Farivar

Abstract
Information retrieval systems have progressed notably from lexical techniques such as BM25 and TF-IDF to modern semantic retrievers. This survey provides a brief overview of the BM25 baseline, then discusses the architecture of modern state-of-the-art semantic retrievers. Advancing from BERT, we introduce dense bi-encoders (DPR), late-interaction models (ColBERT), and neural sparse retrieval (SPLADE). Finally, we examine MonoT5, a cross-encoder model. We conclude with common evaluation tactics, pressing challenges, and propositions for future directions.

August 25, 2025

♥Save to Reading List

Personalization

Towards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model

Abstract
With the advancement of large language models (LLMs), significant progress has been achieved in various Natural Language Processing (NLP) tasks. However, existing LLMs still face two major challenges that hinder their broader adoption: (1) their responses tend to be generic and lack personalization tailored to individual users, and (2) they rely heavily on cloud infrastructure due to intensive computational requirements, leading to stable network dependency and response delay. Recent research has predominantly focused on either developing cloud-based personalized LLMs or exploring the on-device deployment of general-purpose LLMs. However, few studies have addressed both limitations simultaneously by investigating personalized on-device language models. To bridge this gap, we propose CDCDA-PLM, a framework for deploying personalized on-device language models on user devices with support from a powerful cloud-based LLM. Specifically, CDCDA-PLM leverages the server-side LLM's strong generalization capabilities to augment users' limited personal data, mitigating the issue of data scarcity. Using both real and synthetic data, A personalized on-device language models (LMs) is fine-tuned via parameter-efficient fine-tuning (PEFT) modules and deployed on users' local devices, enabling them to process queries without depending on cloud-based LLMs. This approach eliminates reliance on network stability and ensures high response speeds. Experimental results across six tasks in a widely used personalization benchmark demonstrate the effectiveness of CDCDA-PLM.

August 29, 2025

♥Save to Reading List

A Survey of Affective Recommender Systems: Modeling Attitudes, Emotions, and Moods for Personalization

Tonmoy Hasan

Abstract
Affective Recommender Systems are an emerging class of intelligent systems that aim to enhance personalization by aligning recommendations with users' affective states. Reflecting a growing interest, a number of surveys have been published in this area, however they lack an organizing taxonomy grounded in psychology and they often study only specific types of affective states or application domains. This survey addresses these limitations by providing a comprehensive, systematic review of affective recommender systems across diverse domains. Drawing from Scherer's typology of affective states, we introduce a classification scheme that organizes systems into four main categories: attitude aware, emotion aware, mood aware, and hybrid. We further document affective signal extraction techniques, system architectures, and application areas, highlighting key trends, limitations, and open challenges. As future research directions, we emphasize hybrid models that leverage multiple types of affective states across different modalities, the development of large-scale affect-aware datasets, and the need to replace the folk vocabulary of affective states with a more precise terminology grounded in cognitive and social psychology. Through its systematic review of existing research and challenges, this survey aims to serve as a comprehensive reference and a useful guide for advancing academic research and industry applications in affect-driven personalization.

August 27, 2025

♥Save to Reading List

Attribution

Causal SHAP: Feature Attribution with Dependency Awareness through Causal Discovery

Abstract
Explaining machine learning (ML) predictions has become crucial as ML models are increasingly deployed in high-stakes domains such as healthcare. While SHapley Additive exPlanations (SHAP) is widely used for model interpretability, it fails to differentiate between causality and correlation, often misattributing feature importance when features are highly correlated. We propose Causal SHAP, a novel framework that integrates causal relationships into feature attribution while preserving many desirable properties of SHAP. By combining the Peter-Clark (PC) algorithm for causal discovery and the Intervention Calculus when the DAG is Absent (IDA) algorithm for causal strength quantification, our approach addresses the weakness of SHAP. Specifically, Causal SHAP reduces attribution scores for features that are merely correlated with the target, as validated through experiments on both synthetic and real-world datasets. This study contributes to the field of Explainable AI (XAI) by providing a practical framework for causal-aware model explanations. Our approach is particularly valuable in domains such as healthcare, where understanding true causal relationships is critical for informed decision-making.

August 31, 2025

♥Save to Reading List

Bidding

Mean-payoff and Energy Discrete Bidding Games

Abstract
A \emph{bidding} game is played on a graph as follows. A token is placed on an initial vertex and both players are allocated budgets. In each turn, the players simultaneously submit bids that do not exceed their available budgets, the higher bidder moves the token, and pays the bid to the lower bidder. We focus on \emph{discrete}-bidding, which are motivated by practical applications and restrict the granularity of the players' bids, e.g, bids must be given in cents. We study, for the first time, discrete-bidding games with {\em mean-payoff} and {\em energy} objectives. In contrast, mean-payoff {\em continuous}-bidding games (i.e., no granularity restrictions) are understood and exhibit a rich mathematical structure. The {\em threshold} budget is a necessary and sufficient initial budget for winning an energy game or guaranteeing a target payoff in a mean-payoff game. We first establish existence of threshold budgets; a non-trivial property due to the concurrent moves of the players. Moreover, we identify the structure of the thresholds, which is key in obtaining compact strategies, and in turn, showing that finding threshold is in \NP~and \coNP even in succinctly-represented games.

August 30, 2025

♥Save to Reading List

Auctions Meet Bandits: An Empirical Analysis

Abstract
Sponsored search positions are typically allocated through real-time auctions, where the outcomes depend on advertisers' quality-adjusted bids - the product of their bids and quality scores. Although quality scoring helps promote ads with higher conversion outcomes, setting these scores for new advertisers in any given market is challenging, leading to the cold-start problem. To address this, platforms incorporate multi-armed bandit algorithms in auctions to balance exploration and exploitation. However, little is known about the optimal exploration strategies in such auction environments. We utilize data from a leading Asian mobile app store that places sponsored ads for keywords. The platform employs a Thompson Sampling algorithm within a second-price auction to learn quality scores and allocate a single sponsored position for each keyword. We empirically quantify the gains from optimizing exploration under this combined auction-bandit model and show that this problem differs substantially from the canonical bandit problem. Drawing on these empirical insights, we propose a customized exploration strategy in which the platform adjusts the exploration levels for each keyword according to its characteristics. We derive the Pareto frontier for revenue and efficiency and provide actionable policies, demonstrating substantial gains for the platform on both metrics when using a tailored exploration approach.

August 28, 2025

♥Save to Reading List

Marketing Channels

Characterizing and Minimizing Divergent Delivery in Meta Advertising Experiments

Abstract
Many digital platforms offer advertisers experimentation tools like Meta's Lift and A/B tests to optimize their ad campaigns. Lift tests compare outcomes between users eligible to see ads versus users in a no-ad control group. In contrast, A/B tests compare users exposed to alternative ad configurations, absent any control group. The latter setup raises the prospect of divergent delivery: ad delivery algorithms may target different ad variants to different audience segments. This complicates causal interpretation because results may reflect both ad content effectiveness and changes to audience composition. We offer three key contributions. First, we make clear that divergent delivery is specific to A/B tests and intentional, informing advertisers about ad performance in practice. Second, we measure divergent delivery at scale, considering 3,204 Lift tests and 181,890 A/B tests. Lift tests show no meaningful audience imbalance, confirming their causal validity, while A/B tests show clear imbalance, as expected. Third, we demonstrate that campaign configuration choices can reduce divergent delivery in A/B tests, lessening algorithmic influence on results. While no configuration guarantees eliminating divergent delivery entirely, we offer evidence-based guidance for those seeking more generalizable insights about ad content in A/B tests.

August 28, 2025

♥Save to Reading List

A Model of Triple-Channel Interaction Dynamics in Pharmaceutical Retailing in Emerging Economies

a. Department of Industrial and Systems Engineering, Indian Institute of Technology Kharagpur, Kharagpur 721 302, West Bengal, India.

Abstract
The survival of unorganized pharmacies is increasingly challenging in the face of growing competition from organized and e-pharmaceutical retail channels in emerging economies. A theoretical model is developed to capture the triple-channel interactions among unorganized, organized and e-retailing in emerging markets, taking into account the essential features of the pharmaceutical retail landscape, consumers, retailers and pharmaceutical products. Given the retailer and customer-specific factors, the price-setting game between the triple-channel retailers yielded the optimal prices for these retailers. The analysis found that the product category level demand has no influence on optimal pricing strategies of the retailers. The analysis also reveals counterintuitive results, for instance, (i) an increase in customer acceptance of unorganized retailers will result in a decrease in profits of both unorganized and organized retailers; (ii) as the distance and transportation cost to unorganized retailers increases for the consumers, the profit of the unorganized retailer increases; and (iii) consumers marginal utility of money has no influence on the optimal price, but have an influence on the profit of the three retail channels. Our research findings offer valuable insights for policymakers facing challenges in achieving a balanced growth among the organized, unorganized, and e-pharmaceutical retail sectors in emerging economies. Keywords: Unorganized, Organized, and Online E-Retail; Nanostores; Emerging Markets; Game Theory.

August 25, 2025

♥Save to Reading List

Interests not found

Help us improve your experience!