Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a
powerful paradigm for enhancing Large Language Models (LLMs), exemplified by
the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable
signals-such as passing unit tests in code generation or matching correct
answers in mathematical reasoning. While effective, this requirement largely
confines RLVR to domains with automatically checkable outcomes. To overcome
this, we extend the RLVR paradigm to open-ended tasks by integrating
rubric-based rewards, where carefully designed rubrics serve as structured,
model-interpretable criteria for automatic scoring of subjective outputs. We
construct, to our knowledge, the largest rubric reward system to date, with
over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration.
Implementing rubric-based RL is challenging; we tackle these issues with a
clear framework and present an open-sourced Qwen-30B-A3B model with notable
gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended
benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by
+2.4%, while preserving general and reasoning abilities. 2) Our method provides
fine-grained stylistic control, using rubrics as anchors to mitigate the
"AI-like" tone and produce more human-like, expressive responses. We share key
lessons in rubric construction, data selection, and training, and discuss
limitations and future releases.
Abstract
Many applications -- including power systems, robotics, and economics --
involve a dynamical system interacting with a stochastic and hard-to-model
environment. We adopt a reinforcement learning approach to control such
systems. Specifically, we consider a deterministic, discrete-time, linear,
time-invariant dynamical system coupled with a feature-based linear Markov
process with an unknown transition kernel. The objective is to learn a control
policy that optimizes a quadratic cost over the system state, the Markov
process, and the control input. Leveraging both components of the system, we
derive an explicit parametric form for the optimal state-action value function
and the corresponding optimal policy. Our model is distinct in combining
aspects of both classical Linear Quadratic Regulator (LQR) and linear Markov
decision process (MDP) frameworks. This combination retains the implementation
simplicity of LQR, while allowing for sophisticated stochastic modeling
afforded by linear MDPs, without estimating the transition probabilities,
thereby enabling direct policy improvement. We use tools from control theory to
provide theoretical guarantees on the stability of the system under the learned
policy and provide a sample complexity analysis for its convergence to the
optimal policy. We illustrate our results via a numerical example that
demonstrates the effectiveness of our approach in learning the optimal control
policy under partially known stochastic dynamics.