Abstract: Evaluating the quality of retrieval-augmented generation (RAG) and document
reranking systems remains challenging due to the lack of scalable,
user-centric, and multi-perspective evaluation tools. We introduce RankArena, a
unified platform for comparing and analysing the performance of retrieval
pipelines, rerankers, and RAG systems using structured human and LLM-based
feedback as well as for collecting such feedback. RankArena supports multiple
evaluation modes: direct reranking visualisation, blind pairwise comparisons
with human or LLM voting, supervised manual document annotation, and end-to-end
RAG answer quality assessment. It captures fine-grained relevance feedback
through both pairwise preferences and full-list annotations, along with
auxiliary metadata such as movement metrics, annotation time, and quality
ratings. The platform also integrates LLM-as-a-judge evaluation, enabling
comparison between model-generated rankings and human ground truth annotations.
All interactions are stored as structured evaluation datasets that can be used
to train rerankers, reward models, judgment agents, or retrieval strategy
selectors. Our platform is publicly available at https://rankarena.ngrok.io/,
and the Demo video is provided https://youtu.be/jIYAP4PaSSI.
August 07, 2025
Abstract: Re-ranking is critical in recommender systems for optimizing the order of
recommendation lists, thus improving user satisfaction and platform revenue.
Most existing methods follow a generator-evaluator paradigm, where the
evaluator estimates the overall value of each candidate list. However, they
often ignore the fact that users may exit before consuming the full list,
leading to a mismatch between estimated generation value and actual consumption
value. To bridge this gap, we propose CAVE, a personalized Consumption-Aware
list Value Estimation framework. CAVE formulates the list value as the
expectation over sub-list values, weighted by user-specific exit probabilities
at each position. The exit probability is decomposed into an interest-driven
component and a stochastic component, the latter modeled via a Weibull
distribution to capture random external factors such as fatigue. By jointly
modeling sub-list values and user exit behavior, CAVE yields a more faithful
estimate of actual list consumption value. We further contribute three
large-scale real-world list-wise benchmarks from the Kuaishou platform, varying
in size and user activity patterns. Extensive experiments on these benchmarks,
two Amazon datasets, and online A/B testing on Kuaishou show that CAVE
consistently outperforms strong baselines, highlighting the benefit of
explicitly modeling user exits in re-ranking.
August 04, 2025