Information Retrieval

Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods

Abstract
Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce `LaKDA', a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.

♥Save

Enhancing Technical Documents Retrieval for RAG

Abstract
In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of understanding and retrieving complex technical content by leveraging the capabilities of Large Language Models (LLMs). First, we enhance user queries by generating expanded representations that better capture user intent and improve dataset diversity, thereby enriching the fine-tuning process for embedding models. Second, we apply summary extraction techniques to encode essential contextual information, refining the representation of technical documents. To further enhance retrieval performance, we fine-tune a bi-encoder BERT model using soft prompting, incorporating separate learning parameters for queries and document context to capture fine-grained semantic nuances. We evaluate our approach on two public datasets, RAG-EDA and Rust-Docs-QA, demonstrating that Technical-Embeddings significantly outperforms baseline models in both precision and recall. Our findings highlight the effectiveness of integrating query expansion and contextual summarization to enhance information access and comprehension in technical domains. This work advances the state of Retrieval-Augmented Generation (RAG) systems, offering new avenues for efficient and accurate technical document retrieval in engineering and product development workflows.

♥Save

Personalization

Variational Gaussian Mixture Manifold Models for Client-Specific Federated Personalization

Abstract
Personalized federated learning (PFL) often fails under label skew and non-stationarity because a single global parameterization ignores client-specific geometry. We introduce VGM$^2$ (Variational Gaussian Mixture Manifold), a geometry-centric PFL framework that (i) learns client-specific parametric UMAP embeddings, (ii) models latent pairwise distances with mixture relation markers for same and different class pairs, and (iii) exchanges only variational, uncertainty-aware marker statistics. Each client maintains a Dirichlet-Normal-Inverse-Gamma (Dir-NIG) posterior over marker weights, means, and variances; the server aggregates via conjugate moment matching to form global priors that guide subsequent rounds. We prove that this aggregation minimizes the summed reverse Kullback-Leibler divergence from client posteriors within the conjugate family, yielding stability under heterogeneity. We further incorporate a calibration term for distance-to-similarity mapping and report communication and compute budgets. Across eight vision datasets with non-IID label shards, VGM$^2$ achieves competitive or superior test F1 scores compared to strong baselines while communicating only small geometry summaries. Privacy is strengthened through secure aggregation and optional differential privacy noise, and we provide a membership-inference stress test. Code and configurations will be released to ensure full reproducibility.

♥Save

Temporal Interest-Driven Multimodal Personalized Content Generation

Huazhong University of

Abstract
With the dynamic evolution of user interests and the increasing multimodal demands in internet applications, personalized content generation strategies based on static interest preferences struggle to meet practical application requirements. The proposed TIMGen (Temporal Interest-driven Multimodal Generation) model addresses this challenge by modeling the long-term temporal evolution of users' interests and capturing dynamic interest representations with strong temporal dependencies. This model also supports the fusion of multimodal features, such as text, images, video, and audio, and delivers customized content based on multimodal preferences. TIMGen jointly learns temporal dependencies and modal preferences to obtain a unified interest representation, which it then generates to meet users' personalized content needs. TIMGen overcomes the shortcomings of personalized content recommendation methods based on static preferences, enabling flexible and dynamic modeling of users' multimodal interests, better understanding and capturing their interests and preferences. It can be extended to a variety of practical application scenarios, including e-commerce, advertising, online education, and precision medicine, providing insights for future research.

AI Insights

TIMGen’s Transformer embeds timestamps, enabling trend‑aware interest drift detection.
Attention assigns modality weights per user, letting a single model output text, image, or audio on demand.
Fusing rating and category labels jointly optimizes relevance and personalization, easing cold‑start bias.
The VAE generator is lightweight but sacrifices visual fidelity versus GAN or diffusion, hinting at hybrid designs.
Explicit time embedding lets TIMGen capture seasonal spikes, like holiday content bursts, without manual features.
Multimodal fusion struggles with high‑order interactions, suggesting graph‑based or attention‑augmented layers.

♥Save

Ranking

Additive Distributionally Robust Ranking and Selection

Abstract
Ranking and selection (R&S) aims to identify the alternative with the best mean performance among $k$ simulated alternatives. The practical value of R&S depends on accurate simulation input modeling, which often suffers from the curse of input uncertainty due to limited data. Distributionally robust ranking and selection (DRR&S) addresses this challenge by modeling input uncertainty via an ambiguity set of $m > 1$ plausible input distributions, resulting in $km$ scenarios in total. Recent DRR&S studies suggest a key structural insight: additivity in budget allocation is essential for efficiency. However, existing justifications are heuristic, and fundamental properties such as consistency and the precise allocation pattern induced by additivity remain poorly understood. In this paper, we propose a simple additive allocation (AA) procedure that aims to exclusively sample the $k + m - 1$ previously hypothesized critical scenarios. Leveraging boundary-crossing arguments, we establish a lower bound on the probability of correct selection and characterize the procedure's budget allocation behavior. We then prove that AA is consistent and, surprisingly, achieves additivity in the strongest sense: as the total budget increases, only $k + m - 1$ scenarios are sampled infinitely often. Notably, the worst-case scenarios of non-best alternatives may not be among them, challenging prior beliefs about their criticality. These results offer new and counterintuitive insights into the additive structure of DRR&S. To improve practical performance while preserving this structure, we introduce a general additive allocation (GAA) framework that flexibly incorporates sampling rules from traditional R&S procedures in a modular fashion. Numerical experiments support our theoretical findings and demonstrate the competitive performance of the proposed GAA procedures.

♥Save

Efficient Dynamic Rank Aggregation

University of Augsburg, a

Abstract
The rank aggregation problem, which has many real-world applications, refers to the process of combining multiple input rankings into a single aggregated ranking. In dynamic settings, where new rankings arrive over time, efficiently updating the aggregated ranking is essential. This paper develops a fast, theoretically and practically efficient dynamic rank aggregation algorithm. First, we develop the LR-Aggregation algorithm, built on top of the LR-tree data structure, which is itself modeled on the LR-distance, a novel and equivalent take on the classical Spearman's footrule distance. We then analyze the theoretical efficiency of the Pick-A-Perm algorithm, and show how it can be combined with the LR-aggregation algorithm using another data structure that we develop. We demonstrate through experimental evaluations that LR-Aggregation produces close to optimal solutions in practice. We show that Pick-A-Perm has a theoretical worst case approximation guarantee of 2. We also show that both the LR-Aggregation and Pick-A-Perm algorithms, as well as the methodology for combining them can be run in $O(n \log n)$ time. To the best of our knowledge, this is the first fast, near linear time rank aggregation algorithm in the dynamic setting, having both a theoretical approximation guarantee, and excellent practical performance (much better than the theoretical guarantee).

AI Insights

Borda count assigns points inversely proportional to rank position, making it resilient to minor rank shifts.
Pairwise comparison aggregates by majority preference between each pair, yielding a Condorcet‑consistent ranking when it exists.
Dynamic aggregation of streaming gene lists adapts to new experiments with negligible recomputation, as shown in Wang et al. 2022.
Robust variants like median rank or trimmed Borda mitigate outlier influence, addressing weaknesses highlighted in the paper.
Wang et al.'s 2024 survey offers a taxonomy of aggregation methods across domains, from social choice to bioinformatics.
Teng et al. 2018 present a voting aggregation algorithm that optimizes social satisfaction under cardinal utilities, a useful benchmark.

♥Save

Deep Learning

Unveiling the Role of Data Uncertainty in Tabular Deep Learning

HSE University, Yandex

Abstract
Recent advancements in tabular deep learning have demonstrated exceptional practical performance, yet the field often lacks a clear understanding of why these techniques actually succeed. To address this gap, our paper highlights the importance of the concept of data uncertainty for explaining the effectiveness of the recent tabular DL methods. In particular, we reveal that the success of many beneficial design choices in tabular DL, such as numerical feature embeddings, retrieval-augmented models and advanced ensembling strategies, can be largely attributed to their implicit mechanisms for managing high data uncertainty. By dissecting these mechanisms, we provide a unifying understanding of the recent performance improvements. Furthermore, the insights derived from this data-uncertainty perspective directly allowed us to develop more effective numerical feature embeddings as an immediate practical outcome of our analysis. Overall, our work paves the way to foundational understanding of the benefits introduced by modern tabular methods that results in the concrete advancements of existing techniques and outlines future research directions for tabular DL.

AI Insights

Swapping Bayesian, MC‑Dropout, or ensemble uncertainty estimators leaves the MSE trend unchanged across datasets.
Figures show the performance gap between baseline and advanced tabular models is invariant to the uncertainty technique.
This invariance confirms conclusions are not artifacts of a specific uncertainty model.
Authors assume uncertainty estimators are accurate, which may fail in low‑sample or noisy regimes.
Data quality and sampling bias were not modeled, leaving room for future robust preprocessing work.
Recommended resources include “Bayesian Methods for Hackers” and a TensorFlow uncertainty tutorial.
Robustness of tabular DL hinges on design choices and fidelity of uncertainty estimates, inspiring hybrid architectures.

♥Save

Comment on "Deep Regression Learning with Optimal Loss Function"

OpenReview benefits the

Abstract
OpenReview benefits the peer-review system by promoting transparency, openness, and collaboration. By making reviews, comments, and author responses publicly accessible, the platform encourages constructive feedback, reduces bias, and allows the research community to engage directly in the review process. This level of openness fosters higher-quality reviews, greater accountability, and continuous improvement in scholarly communication. In the statistics community, such a transparent and open review system has not traditionally existed. This lack of transparency has contributed to significant variation in the quality of published papers, even in leading journals, with some containing substantial errors in both proofs and numerical analyses. To illustrate this issue, this note examines several results from Wang, Zhou and Lin (2025) [arXiv:2309.12872; https://doi.org/10.1080/01621459.2024.2412364] and highlights potential errors in their proofs, some of which are strikingly obvious. This raises a critical question: how important are mathematical proofs in statistical journals, and how should they be rigorously verified? Addressing this question is essential not only for maintaining academic rigor but also for fostering the right attitudes toward scholarship and quality assurance in the field. A plausible approach would be for arXiv to provide an anonymous discussion section, allowing readers-whether anonymous or not-to post comments, while also giving authors the opportunity to respond.

AI Insights

Theorem 1, 2, and Proposition 1 in Wang et al. (2025) contain algebraic errors that undermine convergence claims.
A chain‑rule misuse in Proposition 1’s gradient derivation exposes a common pitfall in high‑dimensional M‑estimation.
Minor proof mistakes can distort simulations, stressing theory‑code cross‑validation.
An anonymous arXiv discussion could serve as a live proof‑audit platform before acceptance.
Casella & Berger’s text remains essential for mastering probabilistic foundations that safeguard proofs.
Feng et al.’s score‑matching offers a robust alternative to conventional loss functions, aligning with optimality.
JASA’s reproducibility editorial echoes the push for transparent peer review.

♥Save

Search

OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search

Kuaishou Technology

Abstract
Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbf{OneSearch}, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch's superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67\% item CTR, +2.40\% buyer, and +3.22\% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40\% and improves Model FLOPs Utilization from 3.26\% to 27.32\%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.

AI Insights

Generative models now dominate recommendation pipelines, boosting relevance while slashing inference cost.
Large language models are blended with collaborative filtering to surface deeper user intent beyond clicks.
Contrastive learning is being replaced by data‑augmentation tricks to learn sequential preferences without explicit negatives.
“Transformer Memory as a Differentiable Search Index” shows memory‑augmented transformers can serve as fast, trainable retrieval back‑ends.
“Neural Discrete Representation Learning” compresses item embeddings into discrete codes, enabling efficient end‑to‑end generative search.
Generative model: learns to generate new samples resembling training data; contrastive learning: pulls similar pairs together and pushes dissimilar ones apart.

♥Save

Contextualized Token Discrimination for Speech Search Query Correction

Hong Kong University of

Abstract
Query spelling correction is an important function of modern search engines since it effectively helps users express their intentions clearly. With the growing popularity of speech search driven by Automated Speech Recognition (ASR) systems, this paper introduces a novel method named Contextualized Token Discrimination (CTD) to conduct effective speech query correction. In CTD, we first employ BERT to generate token-level contextualized representations and then construct a composition layer to enhance semantic information. Finally, we produce the correct query according to the aggregated token representation, correcting the incorrect tokens by comparing the original token representations and the contextualized representations. Extensive experiments demonstrate the superior performance of our proposed method across all metrics, and we further present a new benchmark dataset with erroneous ASR transcriptions to offer comprehensive evaluations for audio query correction.

AI Insights

Large language models routinely exceed 90 % accuracy on ASR error‑correction benchmarks, beating rule‑based baselines.
Fine‑tuning a pre‑trained transformer on a modest ASR corpus yields 5–10 % gains over training from scratch.
Wav2vec‑2.0 and similar self‑supervised encoders now back most state‑of‑the‑art ASR pipelines, using contextualized embeddings that encode token meaning from surrounding audio.
Their computational footprint still limits low‑latency deployment on edge devices.
Domain mismatch remains a challenge; adaptive fine‑tuning shows promise for cross‑dialect robustness.
“Attention Is All You Need” introduced the transformer that underlies modern ASR and correction models.
“Deep Context: End‑to‑End Contextual Speech Recognition” shows end‑to‑end contextual modeling can replace hand‑crafted language models.

♥Save

Help us improve your experience!