Hi!

Your personalized paper recommendations for 12 to 16 January, 2026.

MLPlatt: Simple Calibration Framework for Ranking Models

Allegro sp z oo

Rate paper: 👍 👎 ♥ Save

AI Insights

CTR: Click-Through Rate Ranker: a model that predicts the ranking order of items based on their relevance to a user's query The proposed method, MLPlatt, demonstrates superior performance compared to other strong baseline approaches. [3]
The paper does not provide a detailed comparison with other state-of-the-art methods. [3]
The method's performance may degrade when dealing with large-scale datasets or complex ranking tasks. [3]
The paper proposes a novel framework called MLPlatt for transforming uncalibrated ranker predictions into CTR probabilities while preserving the ranking order. [2]

Abstract
Ranking models are extensively used in e-commerce for relevance estimation. These models often suffer from poor interpretability and no scale calibration, particularly when trained with typical ranking loss functions. This paper addresses the problem of post-hoc calibration of ranking models. We introduce MLPlatt: a simple yet effective ranking model calibration method that preserves the item ordering and converts ranker outputs to interpretable click-through rate (CTR) probabilities usable in downstream tasks. The method is context-aware by design and achieves good calibration metrics globally, and within strata corresponding to different values of a selected categorical field (such as user country or device), which is often important from a business perspective of an E-commerce platform. We demonstrate the superiority of MLPlatt over existing approaches on two datasets, achieving an improvement of over 10\% in F-ECE (Field Expected Calibration Error) compared to other methods. Most importantly, we show that high-quality calibration can be achieved without compromising the ranking quality.

Why we are recommending this paper?
Due to your Interest in Ranking

This paper directly addresses ranking model calibration, a key area within personalization and search. Given your interest in ranking and personalization, this work offers a valuable approach to improving model accuracy and interpretability.

SPRInG: Continual LLM Personalization via Selective Parametric Adaptation and Retrieval-Interpolated Generation

Yonsei University

Rate paper: 👍 👎 ♥ Save

AI Insights

Replay buffer: A memory buffer in SPRING that stores a subset of the user's interactions, used to update the parametric adapter. [3]
Drift score: A measure of how much a user's interaction history has changed over time, used to select samples for the replay buffer. [3]
ROUGE scores: Metrics used to evaluate the quality of generated text, including ROUGE-1 and ROUGE-L. [3]
The paper presents SPRING, a framework for adapting large language models (LLMs) to individual users based on their interaction history. [2]
SPRING: A framework for adapting large language models to individual users based on their interaction history. [1]

Abstract
Personalizing Large Language Models typically relies on static retrieval or one-time adaptation, assuming user preferences remain invariant over time. However, real-world interactions are dynamic, where user interests continuously evolve, posing a challenge for models to adapt to preference drift without catastrophic forgetting. Standard continual learning approaches often struggle in this context, as they indiscriminately update on noisy interaction streams, failing to distinguish genuine preference shifts from transient contexts. To address this, we introduce SPRInG, a novel semi-parametric framework designed for effective continual personalization. During training, SPRInG employs drift-driven selective adaptation, which utilizes a likelihood-based scoring function to identify high-novelty interactions. This allows the model to selectively update the user-specific adapter on drift signals while preserving hard-to-learn residuals in a replay buffer. During inference, we apply strict relevance gating and fuse parametric knowledge with retrieved history via logit interpolation. Experiments on the long-form personalized generation benchmark demonstrate that SPRInG outperforms existing baselines, validating its robustness for real-world continual personalization.

Why we are recommending this paper?
Due to your Interest in Personalization

Recognizing the dynamic nature of user preferences, this research focuses on continual LLM personalization, aligning with your interest in personalization and search. The approach of adapting to preference drift is particularly relevant to your interests.

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

Fudan University

Rate paper: 👍 👎 ♥ Save

$Paper visualization$

Rate image: 👍 👎

AI Insights

On objective tasks, personalized information may not always improve model performance and can even lead to factual errors or logical biases. [3]
On subjective personalized tasks, personalized information is crucial for improving model performance. [3]
The type of personalized information used can significantly impact model performance. [3]
Aligned personas tend to perform better than unaligned personas on both objective and subjective tasks. [3]
To maximize the benefits of personalized information, it is essential to carefully design and implement the personaDual framework, taking into account the specific requirements and constraints of each application domain. [3]
Personalized information can have a double-edged effect on the performance of large language models (LLMs). [2]

Abstract
As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

Why we are recommending this paper?
Due to your Interest in Personalization

This paper tackles the critical challenge of balancing personalization with objectivity, a core concern within information retrieval. The focus on adaptive reasoning provides a strong foundation for building more robust and trustworthy personalization systems.

LISP -- A Rich Interaction Dataset and Loggable Interactive Search Platform

University of DuisburgEssen

Rate paper: 👍 👎 ♥ Save

$Paper visualization$

Rate image: 👍 👎

AI Insights

It provides a loggable interactive search platform that can be used to study user behavior, develop new algorithms, and improve search engine performance. [3]
The dataset contains a large number of interactions from various sources, including web archives, podcast streaming platforms, and social sciences academic search engines. [3]
The dataset may be biased towards certain types of interactions or users. [3]
Confirmation bias: A phenomenon where people tend to seek out information that confirms their pre-existing beliefs and ignore contradictory evidence. [3]
Loggable interactive search platform: A system that can record user interactions with a search engine or website, providing valuable data for research and development. [2]
LISP rich interaction dataset loggable interactive search platform LISP: A rich interaction dataset and loggable interactive search platform. [1]

Abstract
We present a reusable dataset and accompanying infrastructure for studying human search behavior in Interactive Information Retrieval (IIR). The dataset combines detailed interaction logs from 61 participants (122 sessions) with user characteristics, including perceptual speed, topic-specific interest, search expertise, and demographic information. To facilitate reproducibility and reuse, we provide a fully documented study setup, a web-based perceptual speed test, and a framework for conducting similar user studies. Our work allows researchers to investigate individual and contextual factors affecting search behavior, and to develop or validate user simulators that account for such variability. We illustrate the datasets potential through an illustrative analysis and release all resources as open-access, supporting reproducible research and resource sharing in the IIR community.

Why we are recommending this paper?
Due to your Interest in Search

With your interest in search, this paper offers a valuable dataset and platform for studying human-computer interaction in information retrieval. Analyzing user behavior within this framework will be highly beneficial to your research.

DR-Arena: an Automated Evaluation Framework for Deep Research Agents

National University of Singapore

Rate paper: 👍 👎 ♥ Save

AI Insights

The document describes a system called DR-Arena, which is designed to evaluate the performance of search agents. [3]
The system generates complex research tasks and evaluates the responses from two search agents based on their accuracy, comprehensiveness, formatting, and helpfulness. [3]
DR-Arena is a system that tests search agents' abilities by giving them complex research tasks. [3]
The system evaluates the answers from two search agents based on how accurate, comprehensive, and helpful they are. [3]
DR-Arena The document does not provide information about how the system handles cases where both search agents fail to find the correct entity. [2]

Abstract
As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Why we are recommending this paper?
Due to your Interest in Deep Learning

Given your interest in deep learning and research agents, this paper provides a crucial evaluation framework for LLMs operating as DR agents. The focus on automated evaluation is directly relevant to assessing and improving these complex systems.

Asymptotic rank bounds: a numerical census

Clemson University

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper discusses the problem of finding a strict improvement via the framework of Theorem 3 for non-defective cases. [3]
The authors conduct an exhaustive search and find no counterexamples in every non-defective case with r ≤ 10, suggesting a positive answer to the question. [3]
However, verifying this requires improvements in numerical techniques since the degree alone does not directly determine q in higher codimensions. [3]
Non-defective case: A case where the secant variety σr(V) is non-defective, meaning it has no singularities and its dimension equals the expected value. [3]
Theorem 3: A framework for finding a strict improvement via the degree of the secant variety σr(V). [3]
The authors' exhaustive search suggests that every non-defective case admits a strict improvement via the framework of Theorem 3. [3]
However, verifying this requires improvements in numerical techniques since the degree alone does not directly determine q in higher codimensions. [3]
Improved bound: A bound that is stricter than the geometric bounds provided by Theorem 3. [2]
The authors rely on numerical techniques, which may not be accurate or reliable in higher codimensions. [0]

Abstract
We systematically compute improved asymptotic rank bounds for tensors. Using numerical implicitization, we implement the geometric framework of Kaski and Michałek across all computationally feasible cases. By detecting the absence of low-degree vanishing polynomials on secant varieties, we obtain new asymptotic rank bounds that improve upon the generic border rank bounds. The results provide numerical data supporting Strassen's asymptotic rank conjecture and clarify the computational barriers posed by current numerical methods.

Why we are recommending this paper?
Due to your Interest in Ranking

In-Browser Agents for Search Assistance

University of Passau

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The authors propose a novel approach to modeling user behavior using a graph neural network (GNN), which allows for more accurate predictions and better personalization. [3]
Conversational AI: A type of artificial intelligence that enables computers to understand and respond to human language in a conversational manner. [3]
Graph Neural Network (GNN): A type of neural network designed to handle graph-structured data, which is particularly useful for modeling complex relationships between entities. [3]
The development of web-based conversational AI systems has the potential to revolutionize the way we interact with technology and access information. [3]
The paper cites several studies on conversational AI and user behavior modeling, including the use of GNNs for predicting user interactions. [3]
The system uses a combination of natural language processing (NLP) and machine learning (ML) techniques to understand user queries and provide relevant responses. [2]
The paper discusses the development of a web-based conversational AI system that can simulate user interactions and search behaviors. [1]

Abstract
A fundamental tension exists between the demand for sophisticated AI assistance in web search and the need for user data privacy. Current centralized models require users to transmit sensitive browsing data to external services, which limits user control. In this paper, we present a browser extension that provides a viable in-browser alternative. We introduce a hybrid architecture that functions entirely on the client side, combining two components: (1) an adaptive probabilistic model that learns a user's behavioral policy from direct feedback, and (2) a Small Language Model (SLM), running in the browser, which is grounded by the probabilistic model to generate context-aware suggestions. To evaluate this approach, we conducted a three-week longitudinal user study with 18 participants. Our results show that this privacy-preserving approach is highly effective at adapting to individual user behavior, leading to measurably improved search efficiency. This work demonstrates that sophisticated AI assistance is achievable without compromising user privacy or data control.

Why we are recommending this paper?
Due to your Interest in Search

Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting

Niigata University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The double descent phenomenon in machine learning refers to the observation that as the model size and training data increase, the generalization error of a model can first decrease and then increase again. [3]
This phenomenon has been observed in various contexts, including linear regression, neural networks, and graph convolutional networks. [3]
The double descent curve is characterized by three phases: underfitting, overfitting, and double descent. [3]
In the underfitting phase, the model is too simple to capture the underlying patterns in the data, leading to poor generalization performance. [3]
In the overfitting phase, the model is too complex and captures noise in the data, also leading to poor generalization performance. [3]
The double descent phase occurs when the model size increases beyond a certain point, causing the model to start capturing the underlying patterns in the data again, but with increased capacity for overfitting. [3]
Some of these explanations include the bias-variance trade-off, the effect of noise on fitting linear regression models, and the role of regularization in mitigating double descent. [3]
Researchers have also proposed various methods to mitigate or understand the double descent phenomenon, including optimal regularization, early stopping, and multi-scale feature learning dynamics. [3]
These methods aim to balance the capacity of the model with its ability to generalize well to new data. [3]
The study of double descent has significant implications for machine learning research and practice. [3]
It highlights the importance of understanding the trade-offs between model complexity and generalization performance, and provides insights into how to design models that can generalize well to new data. [3]
However, the existing literature provides valuable insights into this phenomenon and highlights the importance of continued investigation in this area. [3]
Double Descent: A phenomenon where as the model size and training data increase, the generalization error of a model can first decrease and then increase again. [3]
Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor generalization performance. [3]
Overfitting: When a model is too complex and captures noise in the data, also leading to poor generalization performance. [3]
Double Descent Curve: A curve that characterizes the three phases of the double descent phenomenon: underfitting, overfitting, and double descent. [3]
The double descent phenomenon has been studied extensively in recent years, and various explanations have been proposed. [1]

Abstract
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.

Why we are recommending this paper?
Due to your Interest in Deep Learning

PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

Beijing University of Posts and Telecommunications

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The paper discusses a new benchmark called BEIR (Benchmarking Evaluation in Retrieval) for evaluating the performance of information retrieval models. [3]
The benchmark includes 15 datasets and covers various languages. [3]
BEIR: Benchmarking Evaluation in Retrieval LLMs: Large Language Models Position Bias: A phenomenon where the ranking of documents is influenced by their position in the result list, rather than their relevance to the query. [3]
The paper highlights the importance of evaluating information retrieval models using diverse and representative datasets. [3]
The paper does not provide any empirical evaluation or results from experiments conducted using the proposed methods. [3]
The authors also discuss the use of large language models (LLMs) to generate synthetic test collections, which can be used to evaluate the performance of retrieval systems without requiring human annotators. [2]

Abstract
While dense retrieval models have achieved remarkable success, rigorous evaluation of their sensitivity to the position of relevant information (i.e., position bias) remains largely unexplored. Existing benchmarks typically employ position-agnostic relevance labels, conflating the challenge of processing long contexts with the bias against specific evidence locations. To address this challenge, we introduce PosIR (Position-Aware Information Retrieval), a comprehensive benchmark designed to diagnose position bias in diverse retrieval scenarios. PosIR comprises 310 datasets spanning 10 languages and 31 domains, constructed through a rigorous pipeline that ties relevance to precise reference spans, enabling the strict disentanglement of document length from information position. Extensive experiments with 10 state-of-the-art embedding models reveal that: (1) Performance on PosIR in long-context settings correlates poorly with the MMTEB benchmark, exposing limitations in current short-text benchmarks; (2) Position bias is pervasive and intensifies with document length, with most models exhibiting primacy bias while certain models show unexpected recency bias; (3) Gradient-based saliency analysis further uncovers the distinct internal attention mechanisms driving these positional preferences. In summary, PosIR serves as a valuable diagnostic framework to foster the development of position-robust retrieval systems.

Why we are recommending this paper?
Due to your Interest in Information Retrieval

AgriLens: Semantic Retrieval in Agricultural Texts Using Topic Modeling and Language Models

The Chinese University of Hong Kong

Rate paper: 👍 👎 ♥ Save

AI Insights

Agricultural large language models (LLMs) have been increasingly used in various applications such as knowledge transfer and practical application. [2]

Abstract
As the volume of unstructured text continues to grow across domains, there is an urgent need for scalable methods that enable interpretable organization, summarization, and retrieval of information. This work presents a unified framework for interpretable topic modeling, zero-shot topic labeling, and topic-guided semantic retrieval over large agricultural text corpora. Leveraging BERTopic, we extract semantically coherent topics. Each topic is converted into a structured prompt, enabling a language model to generate meaningful topic labels and summaries in a zero-shot manner. Querying and document exploration are supported via dense embeddings and vector search, while a dedicated evaluation module assesses topical coherence and bias. This framework supports scalable and interpretable information access in specialized domains where labeled data is limited.

Why we are recommending this paper?
Due to your Interest in Information Retrieval

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback