Papers from 13 to 17 October, 2025

Here are the personalized paper recommendations sorted by most relevant
Search
👍 👎 ♥ Save
Ruhr University Bochum, U
Paper visualization
Rate this image: 😍 👍 👎
Abstract
The advent of LLMs has given rise to a new type of web search: Generative search, where LLMs retrieve web pages related to a query and generate a single, coherent text as a response. This output modality stands in stark contrast to traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions do generative search outputs differ from traditional web search? We compare Google, a traditional web search engine, with four generative search engines from two providers (Google and OpenAI) across queries from four domains. Our analysis reveals intriguing differences. Most generative search engines cover a wider range of sources compared to web search. Generative search engines vary in the degree to which they rely on internal knowledge contained within the model parameters v.s. external knowledge retrieved from the web. Generative search engines surface varying sets of concepts, creating new opportunities for enhancing search diversity and serendipity. Our results also highlight the need for revisiting evaluation criteria for web search in the age of Generative AI.
👍 👎 ♥ Save
Nanjing University of Aon
Abstract
Query-service relevance prediction in e-commerce search systems faces strict latency requirements that prevent the direct application of Large Language Models (LLMs). To bridge this gap, we propose a two-stage reasoning distillation framework to transfer reasoning capabilities from a powerful teacher LLM to a lightweight, deployment-friendly student model. In the first stage, we address the limitations of general-purpose LLMs by constructing a domain-adapted teacher model. This is achieved through a three-step process: domain-adaptive pre-training to inject platform knowledge, supervised fine-tuning to elicit reasoning skills, and preference optimization with a multi-dimensional reward model to ensure the generation of reliable and preference-aligned reasoning paths. This teacher can then automatically annotate massive query-service pairs from search logs with both relevance labels and reasoning chains. In the second stage, to address the challenges of architectural heterogeneity in standard distillation, we introduce Contrastive Reasoning Self-Distillation (CRSD). By modeling the behavior of the same student model under "standard" and "reasoning-augmented" inputs as a teacher-student relationship, CRSD enables the lightweight model to internalize the teacher's complex decision-making mechanisms without needing the explicit reasoning path at inference. Offline evaluations and online A/B testing in the Meituan search advertising system demonstrate that our framework achieves significant improvements across multiple metrics, validating its effectiveness and practical value.
Personalization
👍 👎 ♥ Save
MBZUAI2ByteDanceNational
Abstract
Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.
AI Insights
  • The benchmark draws from classic Jane Austen and Bernard Shaw excerpts, adding a nostalgic twist to AI research.
  • Long, clause‑laden sentences test detectors’ parsing of complex syntax, a nuance absent from the abstract.
  • The literature review highlights 19th‑century social critique, focusing on marriage and gender roles.
  • Revisiting Pride and Prejudice and Mrs. Warren’s Profession helps grasp the stylistic cues detectors face.
  • A weakness: excerpts may lack broader context, potentially biasing detector evaluation.
  • Inverted features flip formal language meaning, turning subtle cues into misleading signals.
  • Probe datasets along latent feature directions show how tiny stylistic shifts can swing detection accuracy.
👍 👎 ♥ Save
The University of HongKo
Abstract
Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.
AI Insights
  • PerSyn’s gains rise with reward‑model strength; Skywork‑Reward‑V2‑Llama‑3.1‑8B beats the 3B version.
  • Even with a weak reward model, PerSyn outperforms the strong CAR baseline.
  • Baselines Strong, Mix, Family‑Strong, and CAR trail PerSyn in instruction‑tuning and math reasoning.
  • Teacher‑assignment tables show Strong and Family‑Strong favor large models; CAR picks a single average teacher.
  • The router jointly optimizes student learnability and teacher response quality—an innovation beyond the abstract.
  • Suggested papers: Letzelter et al. (2025), Zhang et al. (2025), Ponkshe et al. (2025), Li et al. (2025b), Xu et al. (2025b); Magpie‑Zoo (Xu et al., 2025b) is a key dataset.
Deep Learning
👍 👎 ♥ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Automated rock classification from mineral composition presents a significant challenge in geological applications, with critical implications for material recycling, resource management, and industrial processing. While existing methods using One dimensional Convolutional Neural Network (1D-CNN) excel at mineral identification through Raman spectroscopy, the crucial step of determining rock types from mineral assemblages remains unsolved, particularly because the same minerals can form different rock types depending on their proportions and formation conditions. This study presents a novel knowledge-enhanced deep learning approach that integrates geological domain expertise with spectral analysis. The performance of five machine learning methods were evaluated out of which the 1D-CNN and its uncertainty-aware variant demonstrated excellent mineral classification performance (98.37+-0.006% and 97.75+-0.010% respectively). The integrated system's evaluation on rock samples revealed variable performance across lithologies, with optimal results for limestone classification but reduced accuracy for rocks sharing similar mineral assemblages. These findings not only show critical challenges in automated geological classification systems but also provide a methodological framework for advancing material characterization and sorting technologies.
👍 👎 ♥ Save
University of Bremen, and
Abstract
This article addresses the challenge of learning effective regularizers for linear inverse problems. We analyze and compare several types of learned variational regularization against the theoretical benchmark of the optimal affine reconstruction, i.e. the best possible affine linear map for minimizing the mean squared error. It is known that this optimal reconstruction can be achieved using Tikhonov regularization, but this requires precise knowledge of the noise covariance to properly weight the data fidelity term. However, in many practical applications, noise statistics are unknown. We therefore investigate the performance of regularization methods learned without access to this noise information, focusing on Tikhonov, Lavrentiev, and quadratic regularization. Our theoretical analysis and numerical experiments demonstrate that for non-white noise, a performance gap emerges between these methods and the optimal affine reconstruction. Furthermore, we show that these different types of regularization yield distinct results, highlighting that the choice of regularizer structure is critical when the noise model is not explicitly learned. Our findings underscore the significant value of accurately modeling or co-learning noise statistics in data-driven regularization.
Information Retrieval
👍 👎 ♥ Save
UT Austin, UCLA, Google
Abstract
Modern IR systems are increasingly tasked with answering complex, multi-faceted queries that require deep reasoning rather than simple keyword or semantic matching. While LLM-based IR has shown great promise, the prevailing retrieve-then-rerank paradigm inherits the limitations of embedding-based retrieval; parametric generative approaches are difficult to update with new information; and long-context methods that place the entire corpus in context are computationally infeasible for large document collections. To address these challenges, we introduce LATTICE, a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity by imposing a semantic tree structure on the corpus. Our approach consists of two stages: (1) an offline phase that organizes the corpus into a semantic hierarchy via either a bottom-up agglomerative strategy or a top-down divisive strategy using multi-level summaries and (2) an online traversal phase where a search LLM navigates this tree. A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy, making cross-branch and cross-level comparisons difficult. To overcome this, we propose a traversal algorithm that estimates calibrated latent relevance scores from local LLM outputs and aggregates them into a global path relevance metric. Our training-free framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark, demonstrating up to 9% improvement in Recall@100 and 5% in nDCG@10 over the next best zero-shot baseline. Furthermore, compared to the fine-tuned SOTA method DIVER-v2, LATTICE attains comparable results on BRIGHT subsets that use a static corpus for evaluation.
👍 👎 ♥ Save
Shanghai Jiao Tong Univer
Abstract
Accurately modeling query-item relevance drives e-commerce ranking, yet long-tail, knowledge-heavy, and fast-evolving queries exceed parametric LLM coverage. External context (reviews, attribute encyclopedias, UGC) can help but is noisy, and single-pass latency and cost forbid any clean-then-summarize step. The model must, per query, judge relevance and decide whether to use, partially use, or ignore the context. DyKnow-RAG is a dynamic noisy-RAG framework built on Group Relative Policy Optimization. It trains two rollout groups (no external context vs a single retrieved chunk) and applies posterior-driven inter-group advantage scaling that adaptively reweights their contributions by the per-query correctness gap. This teaches when to trust retrieval versus fall back to parametric knowledge, without process labels, value networks, or extra inference passes, preserving single-pass, single-chunk deployment under production latency. Training combines: (1) supervised initialization with a structured rationale that explicitly records the context-usage decision; (2) an RL pool prioritized by SFT uncertainty to focus where context choice is most consequential; and (3) an optional lightweight DPO warm start to stabilize with-context calibration. Under a unified retrieval/index and fixed latency budget, DyKnow-RAG outperforms SFT, DPO, and vanilla GRPO in offline tests, and delivers consistent lifts on GSB, Query Goodrate, and Item Goodrate in Taobao A/B testing. It is deployed in Taobao's production relevance system, serving live traffic. To our knowledge, it is among the first single-pass RAG solutions for e-commerce relevance, turning noisy external signals into reliable gains without added online complexity.
Ranking
👍 👎 ♥ Save
New York University, Ston
Abstract
A growing class of applications demands \emph{fair ordering/sequencing} of events which ensures that events generated earlier by one client are processed before later events from other clients. However, achieving such sequencing is fundamentally challenging due to the inherent limitations of clock synchronization. We advocate for an approach that embraces, rather than eliminates, clock variability. Instead of attempting to remove error from a timestamp, Tommy, our proposed system, leverages a statistical model to compare two noisy timestamps probabilistically by learning per-clock offset distributions. Our preliminary statistical model computes the probability that one event precedes another w.r.t. the wall-clock time without access to the wall-clock. This serves as a foundation for a new relation: \emph{likely-happened-before} denoted by $\xrightarrow{p}$ where $p$ represents the probability of an event to have happened before another. The $\xrightarrow{p}$ relation provides a basis for ordering multiple events which are otherwise considered \emph{concurrent} by the typical \emph{happened-before} ($\rightarrow$) relation. We highlight various related challenges including intransitivity of $\xrightarrow{p}$ relation as opposed to the transitive $\rightarrow$ relation. We also outline several research directions: online fair sequencing, stochastically fair total ordering, host-level support for fairness and more.
👍 👎 ♥ Save
Abstract
In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
Unsubscribe from these updates