Contextual Relevance and Adaptive Sampling for LLM-Based Document Reranking

University of Illinois at

Why we think this paper is great for you:
This paper directly addresses advancements in reranking algorithms for document retrieval, which is highly relevant to improving search quality. You will find its exploration of LLM-based approaches for reasoning-intensive queries particularly insightful.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Reranking algorithms have made progress in improving document retrieval quality by efficiently aggregating relevance judgments generated by large language models (LLMs). However, identifying relevant documents for queries that require in-depth reasoning remains a major challenge. Reasoning-intensive queries often exhibit multifaceted information needs and nuanced interpretations, rendering document relevance inherently context dependent. To address this, we propose contextual relevance, which we define as the probability that a document is relevant to a given query, marginalized over the distribution of different reranking contexts it may appear in (i.e., the set of candidate documents it is ranked alongside and the order in which the documents are presented to a reranking model). While prior works have studied methods to mitigate the positional bias LLMs exhibit by accounting for the ordering of documents, we empirically find that the compositions of these batches also plays an important role in reranking performance. To efficiently estimate contextual relevance, we propose TS-SetRank, a sampling-based, uncertainty-aware reranking algorithm. Empirically, TS-SetRank improves nDCG@10 over retrieval and reranking baselines by 15-25% on BRIGHT and 6-21% on BEIR, highlighting the importance of modeling relevance as context-dependent.

AI Summary

The proposed "contextual relevance" framework models document relevance as a probability marginalized over diverse reranking contexts, challenging the traditional deterministic and context-independent assumptions. [3]
TS-SetRank, a two-phase Bayesian reranking algorithm combining uniform and adaptive Thompson sampling, efficiently estimates contextual relevance and significantly improves nDCG@10 (15-25% on BRIGHT, 6-21% on BEIR) over baselines. [3]
Deterministic reranking algorithms like Heapify underperform due to their reliance on potentially noisy pairwise comparisons, highlighting the need for methods that aggregate judgments across diverse contexts. [3]
Uniform sampling, while effective in the long run by averaging judgments, exhibits diminishing returns and converges after approximately 300 inference calls, suggesting a practical limit to non-adaptive exploration. [3]
TS-SetRank (Thompson Sampling for Setwise Reranking): A two-phase Bayesian reranking algorithm that first samples document batches uniformly to collect unbiased relevance feedback and then adaptively constructs batches using Thompson sampling to efficiently estimate contextual relevance. [3]
Document relevance for reasoning-intensive queries is context-dependent, influenced by both the composition and ordering of documents within an LLM processing batch. [2]
Empirical analysis reveals that contextual factors (primarily document order within a batch) account for a substantial portion (25-45%) of the variability in LLM-based relevance judgments, beyond intrinsic model stochasticity. [2]
TS-SetRank demonstrates superior performance under smaller inference budgets compared to uniform sampling, indicating its effectiveness in adaptively allocating resources to promising candidates earlier. [2]
Contextual Relevance: The probability that a document is judged relevant to a given query, marginalized over the distribution of different reranking contexts it may appear in (i.e., the set of candidate documents it is ranked alongside and the order in which the documents are presented to a reranking model). [2]
Setwise Prompting Approach: An LLM-based reranking method where smaller subsets or batches of documents are presented to an LLM, which generates per-document binary relevance judgments, subsequently aggregated to form final rankings. [2]
Formally, θi,q = E[Pr(di is judged relevant | q, S)] where S is a batch. [1]

Average Precision at Cutoff k under Random Rankings: Expectation and Variance

Taras Shevchenko National

Why we think this paper is great for you:
This work delves into critical evaluation metrics like Mean Average Precision, essential for assessing the quality of ranking algorithms in information retrieval and recommender systems. It offers valuable insights into the foundations of effective system assessment.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Recommender systems and information retrieval platforms rely on ranking algorithms to present the most relevant items to users, thereby improving engagement and satisfaction. Assessing the quality of these rankings requires reliable evaluation metrics. Among them, Mean Average Precision at cutoff k (MAP@k) is widely used, as it accounts for both the relevance of items and their positions in the list. In this paper, the expectation and variance of Average Precision at k (AP@k) are derived since they can be used as biselines for MAP@k. Here, we covered two widely used evaluation models: offline and online. The expectation establishes the baseline, indicating the level of MAP@k that can be achieved by pure chance. The variance complements this baseline by quantifying the extent of random fluctuations, enabling a more reliable interpretation of observed scores.

RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

TOBB University of Econom

Why we think this paper is great for you:
This framework for optimizing Retrieval-Augmented Generation methods, encompassing retrieval and ranking, offers a comprehensive approach to building robust information systems. You'll appreciate its focus on end-to-end architecture search for better performance.

Rate paper: 👍 👎 ♥ Save

Abstract
Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

Trove: A Flexible Toolkit for Dense Retrieval

Brown University

Why we think this paper is great for you:
As a flexible toolkit for dense retrieval, this paper provides practical tools and efficient data management features crucial for conducting research experiments in information retrieval. It simplifies the process of exploring advanced retrieval techniques.

Rate paper: 👍 👎 ♥ Save

Abstract
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.

Beyond Permissions: Investigating Mobile Personalization with Simulated Personas

Johns Hopkins University

Why we think this paper is great for you:
This research investigates mobile personalization by simulating user personas, offering a deeper understanding of how personalized experiences are delivered. It directly aligns with your focus on tailoring content and services to individuals.

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Mobile applications increasingly rely on sensor data to infer user context and deliver personalized experiences. Yet the mechanisms behind this personalization remain opaque to users and researchers alike. This paper presents a sandbox system that uses sensor spoofing and persona simulation to audit and visualize how mobile apps respond to inferred behaviors. Rather than treating spoofing as adversarial, we demonstrate its use as a tool for behavioral transparency and user empowerment. Our system injects multi-sensor profiles - generated from structured, lifestyle-based personas - into Android devices in real time, enabling users to observe app responses to contexts such as high activity, location shifts, or time-of-day changes. With automated screenshot capture and GPT-4 Vision-based UI summarization, our pipeline helps document subtle personalization cues. Preliminary findings show measurable app adaptations across fitness, e-commerce, and everyday service apps such as weather and navigation. We offer this toolkit as a foundation for privacy-enhancing technologies and user-facing transparency interventions.

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

Johns Hopkins University

Why we think this paper is great for you:
This paper explores personalized decision-making models, highlighting the unique processes that shape individual choices. Its insights into utility optimization and textualized reasoning will be valuable for understanding personalization.

Rate paper: 👍 👎 ♥ Save

Abstract
Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at https://yibozh.github.io/Athena.

Neurosymbolic Deep Learning Semantics

City St Georges, Univer

Why we think this paper is great for you:
This paper explores the foundational semantics of deep learning, a core technology underpinning many advanced information retrieval and personalization systems. It offers a deeper theoretical perspective on the AI methods you utilize.

Rate paper: 👍 👎 ♥ Save

Abstract
Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI's scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.

Interests not found

Help us improve your experience!