University of Massachusst
Abstract
Policy evaluation is often a prerequisite for deploying safety- and
performance-critical systems. Existing evaluation approaches frequently suffer
from high variance due to limited data and long-horizon tasks, or high bias due
to unequal support or inaccurate environmental models. We posit that these
challenges arise, in part, from the standard reinforcement learning (RL)
paradigm of policy learning without explicit consideration of evaluation. As an
alternative, we propose evaluation-aware reinforcement learning (EvA-RL), in
which a policy is trained to maximize expected return while simultaneously
minimizing expected evaluation error under a given value prediction scheme --
in other words, being "easy" to evaluate. We formalize a framework for EvA-RL
and design an instantiation that enables accurate policy evaluation,
conditioned on a small number of rollouts in an assessment environment that can
be different than the deployment environment. However, our theoretical analysis
and empirical results show that there is often a tradeoff between evaluation
accuracy and policy performance when using a fixed value-prediction scheme
within EvA-RL. To mitigate this tradeoff, we extend our approach to co-learn an
assessment-conditioned state-value predictor alongside the policy. Empirical
results across diverse discrete and continuous action domains demonstrate that
EvA-RL can substantially reduce evaluation error while maintaining competitive
returns. This work lays the foundation for a broad new class of RL methods that
treat reliable evaluation as a first-class principle during training.
AI Insights - EvAâRL embeds a learned value predictor into the policyâgradient loss, penalizing evaluation error during training.
- A behavioral encoding captures trajectory statistics, enabling accurate performance estimation in a separate assessment environment.
- When the predictorâs predictability coefficient exceeds 0.8, EvAâRL outperforms vanilla policy gradients in return.
- Predictor training adds computational overhead, so hyperâparameter tuning is critical for speedâaccuracy balance.
- Coâlearning the policy with an assessmentâconditioned value network mitigates the returnâevaluation tradeâoff.
- Deep Reinforcement Learning: A Brief Introduction offers foundational concepts that complement EvAâRLâs framework.
- Proximal Policy Optimization provides a stable baseline to benchmark EvAâRLâs performance gains.
Tencent, The Chinese Unv
Abstract
The growing disparity between the exponential scaling of computational
resources and the finite growth of high-quality text data now constrains
conventional scaling approaches for large language models (LLMs). To address
this challenge, we introduce Reinforcement Learning on Pre-Training data
(RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast
to prior approaches that scale training primarily through supervised learning,
RLPT enables the policy to autonomously explore meaningful trajectories to
learn from pre-training data and improve its capability through reinforcement
learning (RL). While existing RL strategies such as reinforcement learning from
human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR)
rely on human annotation for reward construction, RLPT eliminates this
dependency by deriving reward signals directly from pre-training data.
Specifically, it adopts a next-segment reasoning objective, rewarding the
policy for accurately predicting subsequent text segments conditioned on the
preceding context. This formulation allows RL to be scaled on pre-training
data, encouraging the exploration of richer trajectories across broader
contexts and thereby fostering more generalizable reasoning skills. Extensive
experiments on both general-domain and mathematical reasoning benchmarks across
multiple models validate the effectiveness of RLPT. For example, when applied
to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$,
$6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and
AIME25, respectively. The results further demonstrate favorable scaling
behavior, suggesting strong potential for continued gains with more compute. In
addition, RLPT provides a solid foundation, extending the reasoning boundaries
of LLMs and enhancing RLVR performance.
AI Insights - RLPT turns preâtraining corpora into a selfârewarding playground, letting the model chase its own nextâsegment predictions.
- By forgoing human annotators, RLPT slashes rewardâengineering costs while keeping the signal fresh and dataârich.
- The nextâsegment objective turns every token into a miniâquiz, pushing the policy to explore longer, richer trajectories.
- On Qwen3â4BâBase, RLPT boosts MMLU by up to 8.1 points, proving selfâreward can outpace supervised fineâtuning.
- RLPTâs scaling curve is steepâmore compute yields more reasoning power, hinting at larger gains for future models.
- The framework dovetails with RLVR, unifying selfâreward and humanâfeedback fineâtuning.
- RLPT lets LLMs learn from their own data, sparking curiosityâdriven exploration without external guidance.