Abstract
Test-time scaling (TTS) for large language models (LLMs) has thus far fallen
into two largely separate paradigms: (1) reinforcement learning (RL) methods
that optimize sparse outcome-based rewards, yet suffer from instability and low
sample efficiency; and (2) search-based techniques guided by independently
trained, static process reward models (PRMs), which require expensive human- or
LLM-generated labels and often degrade under distribution shifts. In this
paper, we introduce AIRL-S, the first natural unification of RL-based and
search-based TTS. Central to AIRL-S is the insight that the reward function
learned during RL training inherently represents the ideal PRM for guiding
downstream search. Specifically, we leverage adversarial inverse reinforcement
learning (AIRL) combined with group relative policy optimization (GRPO) to
learn a dense, dynamic PRM directly from correct reasoning traces, entirely
eliminating the need for labeled intermediate process data. At inference, the
resulting PRM simultaneously serves as the critic for RL rollouts and as a
heuristic to effectively guide search procedures, facilitating robust reasoning
chain extension, mitigating reward hacking, and enhancing cross-task
generalization. Experimental results across eight benchmarks, including
mathematics, scientific reasoning, and code generation, demonstrate that our
unified approach improves performance by 9 % on average over the base model,
matching GPT-4o. Furthermore, when integrated into multiple search algorithms,
our PRM consistently outperforms all baseline PRMs trained with labeled data.
These results underscore that, indeed, your reward function for RL is your best
PRM for search, providing a robust and cost-effective solution to complex
reasoning tasks in LLMs.
Abstract
Offline reinforcement learning refers to the process of learning policies
from fixed datasets, without requiring additional environment interaction.
However, it often relies on well-defined reward functions, which are difficult
and expensive to design. Human feedback is an appealing alternative, but its
two common forms, expert demonstrations and preferences, have complementary
limitations. Demonstrations provide stepwise supervision, but they are costly
to collect and often reflect limited expert behavior modes. In contrast,
preferences are easier to collect, but it is unclear which parts of a behavior
contribute most to a trajectory segment, leaving credit assignment unresolved.
In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to
unify these two feedback sources. For each transition in a preference labeled
trajectory, SPW searches for the most similar state-action pairs from expert
demonstrations and directly derives stepwise importance weights based on their
similarity scores. These weights are then used to guide standard preference
learning, enabling more accurate credit assignment that traditional approaches
struggle to achieve. We demonstrate that SPW enables effective joint learning
from preferences and demonstrations, outperforming prior methods that leverage
both feedback types on challenging robot manipulation tasks.