Ranking-aware Reinforcement Learning for Ordinal Ranking

Alibaba Group

Rate paper: 👍 👎 ♥ Save

AI Insights

Regression: A type of prediction problem where the goal is to predict a continuous value. (ML: 0.98)👍👎
The ablation studies reveal that ranking-centric training alone achieves robust ordinal performance, challenging conventional reliance on auxiliary regression supervision. (ML: 0.96)👍👎
Two-stage training strategy: A training approach where the model is trained in two stages, with different objectives and rewards in each stage. (ML: 0.96)👍👎
Ordinal ranking: A type of classification problem where the goal is to predict a rank or order for each sample. (ML: 0.96)👍👎
RARL synergistically enhances regression accuracy and ranking performance through bidirectional regularization. (ML: 0.95)👍👎
The work demonstrates the effectiveness of RARL in achieving state-of-the-art results on ordinal ranking tasks. (ML: 0.95)👍👎
Ranking-aware verifiable rewards: Rewards that are designed to encourage the model to produce accurate rankings. (ML: 0.95)👍👎
Extensive experiments demonstrate state-of-the-art results across three benchmarks, with ablation studies revealing that ranking-centric training alone achieves robust ordinal performance. (ML: 0.92)👍👎
The work introduces RARL, an efficient and scalable framework for ordinal ranking. (ML: 0.91)👍👎
The work relies heavily on the Qwen2.5-VL model, which may not be applicable to other models or tasks. (ML: 0.71)👍👎

Abstract
Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.

Why we are recommending this paper?
Due to your Interest in Travel Ranking

This paper directly addresses ranking, a core interest for the user, utilizing reinforcement learning to optimize ranking models. The focus on ordinal dependencies aligns well with the need for sophisticated travel recommendations and search functionalities.

Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning

Renmin University of China

Rate paper: 👍 👎 ♥ Save

AI Insights

The proposed system, called FAP-CD, uses conditional diffusion generation to create age-friendly community plans that are fair and equitable. (ML: 0.96)👍👎
LLMs: Large Language Models Reinforcement Learning: A type of machine learning where an agent learns to take actions in an environment to maximize a reward. (ML: 0.96)👍👎
FAP-CD's ability to create fair and equitable community plans makes it a valuable tool for policymakers and urban planners. (ML: 0.95)👍👎
Future research should focus on scaling up the system to larger cities and exploring its application in other domains, such as environmental sustainability. (ML: 0.93)👍👎
The authors combine the strengths of LLMs in generating human-like text with the ability of reinforcement learning to optimize decision-making processes. (ML: 0.93)👍👎
The paper proposes a novel approach to urban planning using large language models (LLMs) and reinforcement learning. (ML: 0.93)👍👎
Conditional Diffusion Generation: A technique used in generative models to generate new data samples that are similar to existing ones, but with certain conditions or constraints applied. (ML: 0.90)👍👎
The proposed FAP-CD system demonstrates the potential of using LLMs and reinforcement learning for urban planning. (ML: 0.90)👍👎
FAP-CD is evaluated using a real-world dataset from the city of Beijing, China, and demonstrates significant improvements over traditional urban planning methods. (ML: 0.89)👍👎

Abstract
Effective urban planning is crucial for enhancing residents' quality of life and ensuring societal stability, playing a pivotal role in the sustainable development of cities. Current planning methods heavily rely on human experts, which are time-consuming and labor-intensive, or utilize deep learning algorithms, often limiting stakeholder involvement. To bridge these gaps, we propose Intelli-Planner, a novel framework integrating Deep Reinforcement Learning (DRL) with large language models (LLMs) to facilitate participatory and customized planning scheme generation. Intelli-Planner utilizes demographic, geographic data, and planning preferences to determine high-level planning requirements and demands for each functional type. During training, a knowledge enhancement module is employed to enhance the decision-making capability of the policy network. Additionally, we establish a multi-dimensional evaluation system and employ LLM-based stakeholders for satisfaction scoring. Experimental validation across diverse urban settings shows that Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL-based methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed. These findings underscore the effectiveness and superiority of our framework, highlighting the potential for integrating the latest advancements in LLMs with DRL approaches to revolutionize tasks related to functional areas planning.

Why we are recommending this paper?
Due to your Interest in Travel Planning

Given the user's interest in travel planning and personalization, this paper's exploration of customized urban planning using large language models is highly relevant. The application of reinforcement learning suggests a system capable of generating tailored travel itineraries.

Top-k on a Budget: Adaptive Ranking with Weak and Strong Oracles

University of Liverpool

Rate paper: 👍 👎 ♥ Save

AI Insights

Instance-dependent complexity: The complexity of an algorithm depends on the specific instance it is applied to, rather than just its worst-case performance. (ML: 0.96)👍👎
Future directions include closing the constant-factor gap between the upper and lower bounds on instance-dependent complexity. (ML: 0.90)👍👎
The paper assumes the existence of weak and strong oracles, which may not be feasible in all scenarios. (ML: 0.89)👍👎
The algorithms ACE and ACE-W adaptively focus strong evaluations on critical items, yielding instance-dependent complexity governed by the near-tie mass. (ML: 0.88)👍👎
PAC guarantees: Probably Approximately Correct Two-oracle framework: A framework that uses two types of oracles (weak and strong) to certify the exact top-k set. (ML: 0.81)👍👎
The paper introduces a two-oracle framework for certifying the exact top-k set under PAC guarantees. (ML: 0.81)👍👎
The algorithms ACE and ACE-W are shown to be effective in reducing strong oracle usage, making them suitable for applications where computational resources are limited. (ML: 0.80)👍👎
The paper provides a new approach to certifying the exact top-k set under PAC guarantees, with significant improvements over existing methods. (ML: 0.78)👍👎
The experiments show that ACE and ACE-W achieve significant reductions in strong oracle usage, with speedups of 2.4x and 2.8x over TA and STC respectively. (ML: 0.67)👍👎

Abstract
Identifying the top-$k$ items is fundamental but often prohibitive when exact valuations are expensive. We study a two-oracle setting with a fast, noisy weak oracle and a scarce, high-fidelity strong oracle (e.g., human expert verification or expensive simulation). We first analyze a simple screen-then-certify baseline (STC) and prove it makes at most $m(4\varepsilon_{\max})$ strong calls given jointly valid weak confidence intervals with maximum radius $\varepsilon_{\max}$, where $m(\cdot)$ denotes the near-tie mass around the top-$k$ threshold. We establish a conditional lower bound of $Ω(m(\varepsilon_{\max}))$ for any algorithm given the same weak uncertainty. Our main contribution is ACE, an adaptive certification algorithm that focuses strong queries on critical boundary items, achieving the same $O(m(4\varepsilon_{\max}))$ bound while reducing strong calls in practice. We then introduce ACE-W, a fully adaptive two-phase method that allocates weak budget adaptively before running ACE, further reducing strong costs.

Why we are recommending this paper?
Due to your Interest in Travel Ranking

The paper’s focus on adaptive ranking and utilizing oracles to improve ranking accuracy directly relates to the user’s interest in travel ranking and recommendation systems. This approach is crucial for creating effective travel search and filtering tools.

GeoRC: A Benchmark for Geolocation Reasoning Chains

Georgia Institute of Technology

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Statements that limit the predictive power of an attribute are valuable, and evidence varies in diagnostic strength, with some cues being weakly diagnostic while others are highly diagnostic. (ML: 0.99)👍👎
The guidelines for human reasoning chains in GeoGuessr emphasize the importance of following the actual thought process of the expert, with no preferred order. (ML: 0.98)👍👎
The guidelines also advise against referring to oneself, instead using phrases such as 'The grass looks like X.' The human grading process involves scoring candidate VLM reasoning chains with reference reasoning chains written by the best expert, using a 1-to-All LLM-as-a-judge algorithm and a Key Points guided LLM judging algorithm. (ML: 0.98)👍👎
The typical statements in a reasoning chain identify an attribute and describe its approximate geographic support, using visual descriptions that are easier to verify and inspire more trust. (ML: 0.98)👍👎
A human reasoning chain is a sequence of statements that describe the thought process used to arrive at a geolocation guess. (ML: 0.97)👍👎
It is not necessary to explain every attribute in exhaustive detail, but including a few distinguishing features can be helpful. (ML: 0.96)👍👎
A VLM (Vision-Language Model) is a type of AI model that can understand and generate text related to images. (ML: 0.96)👍👎
The results show that the one-to-all scoring approach aligns best with the F1 scores across different flavors of candidates, and the qualitative results demonstrate failure scenarios similar to those shown in Figure 7. (ML: 0.95)👍👎
Broad descriptors such as 'arid,' 'temperate,' or 'tropical' are useful despite being geographically widespread. (ML: 0.94)👍👎
LLM (Large Language Model) refers to a type of AI model that can understand and generate human-like language, often used for tasks such as question-answering and text generation. (ML: 0.94)👍👎
Reasoning chains can be arbitrarily long and should include substantial supporting evidence when present. (ML: 0.93)👍👎

Abstract
Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

Why we are recommending this paper?
Due to your Interest in Travel Search

With a strong interest in travel itinerary creation, understanding geolocation is key. This research explores Vision Language Models for location prediction, which could be valuable for generating more accurate travel recommendations and route planning.

A General Mixture Loss Function to Optimize a Personalized Predictive Model

York University

Rate paper: 👍 👎 ♥ Save

AI Insights

AUPRC: Area Under the Precision-Recall Curve Lack of spread: A measure of model performance that is not well-defined in this context. (ML: 0.97)👍👎
Calibration slope: A measure of model calibration. (ML: 0.96)👍👎
When using the second loss function (L∗∗), which includes the AUPRC as a measure of discrimination, the optimal Mis proportion is much higher than under L∗. (ML: 0.96)👍👎
CITL: Calibration-In-The-Large, a measure of model calibration. (ML: 0.96)👍👎
The best performing model under L∗ has an AUPRC value of 0.475 and a lack of spread value of 0.063 when α = 0.1. (ML: 0.95)👍👎
The results suggest that the choice of loss function and the value of α have a significant impact on the performance of the model. (ML: 0.95)👍👎
The proposed algorithm for tuning the size of subpopulation (Mis) is applied to a real-world dataset from the eICU cardiac database. (ML: 0.93)👍👎
When using the first loss function (L∗), the optimal Mis proportion is 0.29 under all values of α except α = 0.9, where it is 0.20. (ML: 0.90)👍👎
The results show that the optimal Mis value varies depending on the choice of alpha (α), which controls the emphasis on discrimination and calibration in the loss function. (ML: 0.89)👍👎

Abstract
Advances in precision medicine increasingly drive methodological innovation in health research. A key development is the use of personalized prediction models (PPMs), which are fit using a similar subpopulation tailored to a specific index patient, and have been shown to outperform one-size-fits-all models, particularly in terms of model discrimination performance. We propose a generalized loss function that enables tuning of the subpopulation size used to fit a PPM. This loss function allows joint optimization of discrimination and calibration, allowing both the performance measures and their relative weights to be specified by the user. To reduce computational burden, we conducted extensive simulation studies to identify practical bounds for the grid of subpopulation sizes. Based on these results, we recommend using a lower bound of 20\% and an upper bound of 70\% of the entire training dataset. We apply the proposed method to both simulated and real-world datasets and demonstrate that previously observed relationships between subpopulation size and model performance are robust. Furthermore, we show that the choice of performance measures in the loss function influences the optimal subpopulation size selected. These findings support the flexible and computationally efficient implementation of PPMs in precision health research.

Why we are recommending this paper?
Due to your Interest in Travel Personalization

This paper's focus on personalized predictive models aligns with the user's interest in travel personalization. Utilizing mixture loss functions to tailor models to individual preferences is a key step in creating more relevant travel recommendations.

Interests not found

Help us improve your experience!