Abstract
Retrieval-Augmented Generation (RAG) has become a cornerstone technique for
enhancing large language models (LLMs) with external knowledge. However,
current RAG systems face two critical limitations: (1) they inefficiently
retrieve information for every query, including simple questions that could be
resolved using the LLM's parametric knowledge alone, and (2) they risk
retrieving irrelevant documents when queries contain sparse information
signals. To address these gaps, we introduce Parametric-verified Adaptive
Information Retrieval and Selection (PAIRS), a training-free framework that
integrates parametric and retrieved knowledge to adaptively determine whether
to retrieve and how to select external information. Specifically, PAIRS employs
a dual-path generation mechanism: First, the LLM produces both a direct answer
and a context-augmented answer using self-generated pseudo-context. When these
outputs converge, PAIRS bypasses external retrieval entirely, dramatically
improving the RAG system's efficiency. For divergent cases, PAIRS activates a
dual-path retrieval (DPR) process guided by both the original query and
self-generated contextual signals, followed by an Adaptive Information
Selection (AIS) module that filters documents through weighted similarity to
both sources. This simple yet effective approach can not only enhance
efficiency by eliminating unnecessary retrievals but also improve accuracy
through contextually guided retrieval and adaptive information selection.
Experimental results on six question-answering (QA) benchmarks show that PAIRS
reduces retrieval costs by around 25% (triggering for only 75% of queries)
while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior
baselines on average.
Abstract
Evaluating the quality of retrieval-augmented generation (RAG) and document
reranking systems remains challenging due to the lack of scalable,
user-centric, and multi-perspective evaluation tools. We introduce RankArena, a
unified platform for comparing and analysing the performance of retrieval
pipelines, rerankers, and RAG systems using structured human and LLM-based
feedback as well as for collecting such feedback. RankArena supports multiple
evaluation modes: direct reranking visualisation, blind pairwise comparisons
with human or LLM voting, supervised manual document annotation, and end-to-end
RAG answer quality assessment. It captures fine-grained relevance feedback
through both pairwise preferences and full-list annotations, along with
auxiliary metadata such as movement metrics, annotation time, and quality
ratings. The platform also integrates LLM-as-a-judge evaluation, enabling
comparison between model-generated rankings and human ground truth annotations.
All interactions are stored as structured evaluation datasets that can be used
to train rerankers, reward models, judgment agents, or retrieval strategy
selectors. Our platform is publicly available at https://rankarena.ngrok.io/,
and the Demo video is provided https://youtu.be/jIYAP4PaSSI.