Data Science Management

Several Issues Regarding Data Governance in AGI

Abstract
The rapid advancement of artificial intelligence has positioned data governance as a critical concern for responsible AI development. While frameworks exist for conventional AI systems, the potential emergence of Artificial General Intelligence (AGI) presents unprecedented governance challenges. This paper examines data governance challenges specific to AGI, defined as systems capable of recursive self-improvement or self-replication. We identify seven key issues that differentiate AGI governance from current approaches. First, AGI may autonomously determine what data to collect and how to use it, potentially circumventing existing consent mechanisms. Second, these systems may make data retention decisions based on internal optimization criteria rather than human-established principles. Third, AGI-to-AGI data sharing could occur at speeds and complexities beyond human oversight. Fourth, recursive self-improvement creates unique provenance tracking challenges, as systems evolve both themselves and how they process data. Fifth, ownership of data and insights generated through self-improvement raises complex intellectual property questions. Sixth, self-replicating AGI distributed across jurisdictions would create unprecedented challenges for enforcing data protection laws. Finally, governance frameworks established during early AGI development may quickly become obsolete as systems evolve. We conclude that effective AGI data governance requires built-in constraints, continuous monitoring mechanisms, dynamic governance structures, international coordination, and multi-stakeholder involvement. Without forward-looking governance approaches specifically designed for systems with autonomous data capabilities, we risk creating AGI whose relationship with data evolves in ways that undermine human values and interests.

August 16, 2025

♥Save to Reading List

Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study

Abstract
In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle. By analyzing relevant papers from Scopus and IEEE databases, we identify and categorize the types of LLMs being applied, the specific stages and tasks of the data science process they address, and the methodological approaches used for their evaluation. Our analysis includes a detailed examination of evaluation metrics employed across studies and systematically documents both positive contributions and limitations of LLMs when applied to data science workflows. This mapping provides researchers and practitioners with a structured understanding of the current landscape, highlighting trends, gaps, and opportunities for future research in this rapidly evolving intersection of LLMs and data science.

August 12, 2025

♥Save to Reading List

Paid Search

ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

Abstract
Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.

August 12, 2025

♥Save to Reading List

Personalization

TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models

Abstract
Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at https://github.com/YuqiPeng77/TARA.

August 12, 2025

♥Save to Reading List

Adaptive Personalized Conversational Information Retrieval

Abstract
Personalized conversational information retrieval (CIR) systems aim to satisfy users' complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users' personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all'' personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods.

August 12, 2025

♥Save to Reading List

Attribution

Passive Hack-Back Strategies for Cyber Attribution: Covert Vectors in Denied Environment

Abstract
Attributing cyberattacks remains a central challenge in modern cybersecurity, particularly within denied environments where defenders have limited visibility into attacker infrastructure and are restricted by legal or operational rules of engagement. This perspective examines the strategic value of passive hack-back techniques that enable covert attribution and intelligence collection without initiating direct offensive actions. Key vectors include tracking beacons, honeytokens, environment-specific payloads, and supply-chain-based traps embedded within exfiltrated or leaked assets. These approaches rely on the assumption that attackers will interact with compromised data in traceable ways, allowing defenders to gather signals without violating engagement policies. The paper also explores the role of Artificial Intelligence (AI) in enhancing passive hack-back operations. Topics include the deployment of autonomous agents for forensic reconnaissance, the use of Large Language Models (LLMs) to generate dynamic payloads, and Adversarial Machine Learning (AML) techniques for evasion and counter-deception. A dedicated section discusses the implications of quantum technologies in this context, both as future threats to cryptographic telemetry and as potential tools for stealthy communication and post-quantum resilience. Finally, the paper advocates for hybrid defensive frameworks that combine passive attribution with delayed or conditional active responses, while maintaining compliance with legal, ethical, and operational constraints.

August 17, 2025

♥Save to Reading List

Amazon Ads Multi-Touch Attribution

Abstract
Amazon's new Multi-Touch Attribution (MTA) solution allows advertisers to measure how each touchpoint across the marketing funnel contributes to a conversion. This gives advertisers a more comprehensive view of their Amazon Ads performance across objectives when multiple ads influence shopping decisions. Amazon MTA uses a combination of randomized controlled trials (RCTs) and machine learning (ML) models to allocate credit for Amazon conversions across Amazon Ads touchpoints in proportion to their value, i.e., their likely contribution to shopping decisions. ML models trained purely on observational data are easy to scale and can yield precise predictions, but the models might produce biased estimates of ad effects. RCTs yield unbiased ad effects but can be noisy. Our MTA methodology combines experiments, ML models, and Amazon's shopping signals in a thoughtful manner to inform attribution credit allocation.

August 11, 2025

♥Save to Reading List

Direction on Data Science Organizations

Advancing Data Equity: Practitioner Responsibility and Accountability in NLP Data Practices

Abstract
While research has focused on surfacing and auditing algorithmic bias to ensure equitable AI development, less is known about how NLP practitioners - those directly involved in dataset development, annotation, and deployment - perceive and navigate issues of NLP data equity. This study is among the first to center practitioners' perspectives, linking their experiences to a multi-scalar AI governance framework and advancing participatory recommendations that bridge technical, policy, and community domains. Drawing on a 2024 questionnaire and focus group, we examine how U.S.-based NLP data practitioners conceptualize fairness, contend with organizational and systemic constraints, and engage emerging governance efforts such as the U.S. AI Bill of Rights. Findings reveal persistent tensions between commercial objectives and equity commitments, alongside calls for more participatory and accountable data workflows. We critically engage debates on data diversity and diversity washing, arguing that improving NLP equity requires structural governance reforms that support practitioner agency and community consent.

August 13, 2025

♥Save to Reading List

Bidding

Optimal Boost Design for Auto-bidding Mechanism with Publisher Quality Constraints

Abstract
Online bidding is crucial in mobile ecosystems, enabling real-time ad allocation across billions of devices to optimize performance and user experience. Improving ad allocation efficiency is a long-standing research problem, as it directly enhances the economic outcomes for all participants in advertising platforms. This paper investigates the design of optimal boost factors in online bidding while incorporating quality value (the impact of displayed ads on publishers' long-term benefits). To address the divergent interests on quality, we establish a three-party auction framework with a unified welfare metric of advertiser and publisher. Within this framework, we derive the theoretical efficiency lower bound for C-competitive boost in second-price single-slot auctions, then design a novel quality-involved Boosting (q-Boost) algorithm for computing the optimal boost factor. Experimental validation on Alibaba's public dataset (AuctionNet) demonstrates 2%-6% welfare improvements over conventional approaches, proving our method's effectiveness in real-world settings.

August 12, 2025

♥Save to Reading List

Expert-Guided Diffusion Planner for Auto-bidding

Abstract
Auto-bidding is extensively applied in advertising systems, serving a multitude of advertisers. Generative bidding is gradually gaining traction due to its robust planning capabilities and generalizability. In contrast to traditional reinforcement learning-based bidding, generative bidding does not rely on the Markov Decision Process (MDP) exhibiting superior planning capabilities in long-horizon scenarios. Conditional diffusion modeling approaches have demonstrated significant potential in the realm of auto-bidding. However, relying solely on return as the optimality condition is weak to guarantee the generation of genuinely optimal decision sequences, lacking personalized structural information. Moreover, diffusion models' t-step autoregressive generation mechanism inherently carries timeliness risks. To address these issues, we propose a novel conditional diffusion modeling method based on expert trajectory guidance combined with a skip-step sampling strategy to enhance generation efficiency. We have validated the effectiveness of this approach through extensive offline experiments and achieved statistically significant results in online A/B testing, achieving an increase of 11.29% in conversion and a 12.35% in revenue compared with the baseline.

August 12, 2025

♥Save to Reading List

Marketing Channels

Adaptive Source-Channel Coding for Semantic Communications

Abstract
Semantic communications (SemComs) have emerged as a promising paradigm for joint data and task-oriented transmissions, combining the demands for both the bit-accurate delivery and end-to-end (E2E) distortion minimization. However, current joint source-channel coding (JSCC) in SemComs is not compatible with the existing communication systems and cannot adapt to the variations of the sources or the channels, while separate source-channel coding (SSCC) is suboptimal in the finite blocklength regime. To address these issues, we propose an adaptive source-channel coding (ASCC) scheme for SemComs over parallel Gaussian channels, where the deep neural network (DNN)-based semantic source coding and conventional digital channel coding are separately deployed and adaptively designed. To enable efficient adaptation between the source and channel coding, we first approximate the E2E data and semantic distortions as functions of source coding rate and bit error ratio (BER) via logistic regression, where BER is further modeled as functions of signal-to-noise ratio (SNR) and channel coding rate. Then, we formulate the weighted sum E2E distortion minimization problem for joint source-channel coding rate and power allocation over parallel channels, which is solved by the successive convex approximation. Finally, simulation results demonstrate that the proposed ASCC scheme outperforms typical deep JSCC and SSCC schemes for both the single- and parallel-channel scenarios while maintaining full compatibility with practical digital systems.

August 11, 2025

♥Save to Reading List

Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

Abstract
Business communication digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and communication psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised transformer models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of communication and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans.

August 13, 2025

♥Save to Reading List

Interests not found

Help us improve your experience!