Abstract
Test-time scaling (TTS) for large language models (LLMs) has thus far fallen
into two largely separate paradigms: (1) reinforcement learning (RL) methods
that optimize sparse outcome-based rewards, yet suffer from instability and low
sample efficiency; and (2) search-based techniques guided by independently
trained, static process reward models (PRMs), which require expensive human- or
LLM-generated labels and often degrade under distribution shifts. In this
paper, we introduce AIRL-S, the first natural unification of RL-based and
search-based TTS. Central to AIRL-S is the insight that the reward function
learned during RL training inherently represents the ideal PRM for guiding
downstream search. Specifically, we leverage adversarial inverse reinforcement
learning (AIRL) combined with group relative policy optimization (GRPO) to
learn a dense, dynamic PRM directly from correct reasoning traces, entirely
eliminating the need for labeled intermediate process data. At inference, the
resulting PRM simultaneously serves as the critic for RL rollouts and as a
heuristic to effectively guide search procedures, facilitating robust reasoning
chain extension, mitigating reward hacking, and enhancing cross-task
generalization. Experimental results across eight benchmarks, including
mathematics, scientific reasoning, and code generation, demonstrate that our
unified approach improves performance by 9 % on average over the base model,
matching GPT-4o. Furthermore, when integrated into multiple search algorithms,
our PRM consistently outperforms all baseline PRMs trained with labeled data.
These results underscore that, indeed, your reward function for RL is your best
PRM for search, providing a robust and cost-effective solution to complex
reasoning tasks in LLMs.
Abstract
Query optimization is a crucial problem in database systems that has been
studied for decades. Learned query optimizers (LQOs) can improve performance
over time by incorporating feedback; however, they suffer from cold-start
issues and often require retraining when workloads shift or schemas change.
Recent LLM-based query optimizers leverage pre-trained and fine-tuned LLMs to
mitigate these challenges. Nevertheless, they neglect LLMs' in-context learning
and execution records as feedback for continuous evolution. In this paper, we
present SEFRQO, a Self-Evolving Fine-tuned RAG-based Query Optimizer. SEFRQO
mitigates the cold-start problem of LQOs by continuously learning from
execution feedback via a Retrieval-Augmented Generation (RAG) framework. We
employ both supervised fine-tuning and reinforcement fine-tuning to prepare the
LLM to produce syntactically correct and performance-efficient query hints.
Moreover, SEFRQO leverages the LLM's in-context learning capabilities by
dynamically constructing prompts with references to similar queries and the
historical execution record of the same query. This self-evolving paradigm
iteratively optimizes the prompt to minimize query execution latency.
Evaluations show that SEFRQO outperforms state-of-the-art LQOs, achieving up to
65.05% and 93.57% reductions in query latency on the CEB and Stack workloads,
respectively, compared to PostgreSQL.