Mercor
AI Insights - McNemar's exact test: A statistical test used to compare the performance of two related samples. (ML: 0.97)šš
- Pass@1: The proportion of tasks completed correctly by an agent. (ML: 0.95)šš
- Significance tests using McNemar's exact test with Benjamini-Hochberg correction show that Kimi-K2-Thinking significantly outperforms Gemini-3-flash-preview (p=5.68e-23), GPT-OSS-120B (p=1.0000), and GPT-5.2 (p=7.29e-10). (ML: 0.95)šš
- The APEXāAgents benchmark highlights the importance of developing AI models that can perform complex tasks in various professional domains, with a focus on toolbelt approaches, context window management, and intentional termination. (ML: 0.94)šš
- Benjamini-Hochberg correction: A method for controlling false discovery rate in multiple testing. (ML: 0.94)šš
- The APEXāAgents benchmark is a comprehensive evaluation of AI models' ability to perform complex tasks in various professional domains. (ML: 0.93)šš
- The most frequently used tools by agents are code execution (256,000), add tool to the toolbelt (200,000), list files in the file system (163,874), read spreadsheet tab (127,000), and search the PDF (86,000). (ML: 0.93)šš
- The benchmark consists of 227 tasks, covering finance, law, and management consulting, with each task requiring the model to complete a specific task using a set of provided tools. (ML: 0.89)šš
- The top-performing models on the APEXāAgents benchmark are Gemini 3 Flash, GPT-5.2, and Kimi K2 Thinking, with Pass@1 scores of 0.555, 0.497, and 0.391 respectively. (ML: 0.88)šš
- ReAct paradigm: A toolbelt approach where reasoning and acting are interleaved in a single loop. (ML: 0.79)šš
Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
Why we are recommending this paper?
Due to your Interest in Agentic RL
This paper presents a benchmark specifically designed for evaluating agentic behavior in complex, long-horizon tasks, aligning directly with your interest in Agentic RL. The focus on realistic work scenarios using investment banking and legal professionals offers a valuable case study for your research.
Renmin University of China
AI Insights - Agentic capabilities: Fundamental skills like exploration, tool use, and self-verification. (ML: 0.96)šš
- Current results have limitations, such as generated videos being limited to simple animations and composed music lacking expressiveness and creativity. (ML: 0.95)šš
- The agentic capability benchmark provided by LLM-in-Sandbox can be used to evaluate models' ability to leverage computational environments. (ML: 0.94)šš
- Strong LLMs exhibit emergent capabilities to leverage the sandbox environment for general tasks. (ML: 0.92)šš
- LLM-in-Sandbox can be used as an agentic capability benchmark, measuring fundamental skills like exploration, tool use, and self-verification. (ML: 0.91)šš
- The metric ā=LLM-in-SandboxāLLM offers a meaningful indicator of a model's ability to leverage computational environments. (ML: 0.90)šš
- LLM-in-Sandbox has the potential to become the default paradigm for serving LLMs, enabling them to perform general tasks and produce actual outputs rather than text descriptions. (ML: 0.88)šš
- LLM-in-Sandbox: A paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.85)šš
- Sandbox-native model training: Training models to interact with the sandbox environment as a first-class objective. (ML: 0.82)šš
- LLM-in-Sandbox is a paradigm that grants LLMs access to a virtual computer and enables them to leverage this environment for general tasks. (ML: 0.80)šš
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
Why we are recommending this paper?
Due to your Interest in Agentic RL
This work explores the potential of LLMs to exhibit general agentic intelligence through sandbox environments, a fascinating approach to developing more adaptable AI systems. Given your interest in deep learning for reinforcement learning, this paper's methodology is particularly relevant.
Harvard University
AI Insights - Bias-variance tradeoff: The trade-off between the accuracy of a model (bias) and its ability to generalize to new data (variance). (ML: 0.99)šš
- Pooling data across heterogeneous individuals, simplifying model structure, and using a shorter learning horizon are common strategies for bias-variance control in RL. (ML: 0.96)šš
- Causal knowledge is an important mechanism for navigating bias-variance tradeoffs in RL, enabling learning and optimizing with fewer interactions with the environment. (ML: 0.95)šš
- Causal knowledge can be used to navigate bias-variance tradeoffs in RL by enabling learning and optimizing with fewer interactions with the environment. (ML: 0.95)šš
- Counterfactual data augmentation: The use of artificial data generated from a known causal DAG to artificially increase the total sample size in online learning. (ML: 0.94)šš
- Factored MDPs: A type of Markov decision process where the state transition probabilities can be factorized using a causal DAG. (ML: 0.88)šš
- Causal DAGs can be used to handle distal or sparse rewards, effectively define states or design action-selection strategies, and embed causal knowledge into Bayesian priors, counterfactual data, or invariant dynamics. (ML: 0.87)šš
- Causal DAGs can be used to handle distal or sparse rewards, effectively define states or design action-selection strategies, and embed causal knowledge into Bayesian priors, counterfactual data, or invariant dynamics. (ML: 0.87)šš
- Causal DAG: A directed acyclic graph that encodes conditional causal independencies among sets of variables. (ML: 0.85)šš
- The assumption that the causal DAG is perfectly specified is strong and often wrong, especially when the environment is changing over sequential deployments. (ML: 0.85)šš
Abstract
Reinforcement learning (RL) has achieved remarkable success in real-world decision-making across diverse domains, including gaming, robotics, online advertising, public health, and natural language processing. Despite these advances, a substantial gap remains between RL research and its deployment in many practical settings. Two recurring challenges often underlie this gap. First, many settings offer limited opportunity for the agent to interact extensively with the target environment due to practical constraints. Second, many target environments often undergo substantial changes, requiring redesign and redeployment of RL systems (e.g., advancements in science and technology that change the landscape of healthcare delivery). Addressing these challenges and bridging the gap between basic research and application requires theory and methodology that directly inform the design, implementation, and continual improvement of RL systems in real-world settings.
In this paper, we frame the application of RL in practice as a three-component process: (i) online learning and optimization during deployment, (ii) post- or between-deployment offline analyses, and (iii) repeated cycles of deployment and redeployment to continually improve the RL system. We provide a narrative review of recent advances in statistical RL that address these components, including methods for maximizing data utility for between-deployment inference, enhancing sample efficiency for online learning within-deployment, and designing sequences of deployments for continual improvement. We also outline future research directions in statistical RL that are use-inspired -- aiming for impactful application of RL in practice.
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning
Coming from Harvard University, this survey paper provides a comprehensive overview of the challenges and future directions in RL deployment, directly addressing the practical application of reinforcement learning techniques. Itās a solid foundation for understanding the broader landscape of your interests.
Innopolis University
AI Insights - Reinforcement Learning (RL): A subfield of machine learning where an agent learns to take actions in an environment to maximize a reward. (ML: 0.96)šš
- A comprehensive review of existing literature on memory in RL is provided, highlighting key findings and open research questions. (ML: 0.93)šš
- The paper also discusses the challenges of evaluating RL agents' performance, particularly when it comes to memory-based methods. (ML: 0.92)šš
- Partially Observable Markov Decision Process (POMDP): A mathematical framework for modeling decision-making problems with uncertainty. (ML: 0.91)šš
- ELMUR is designed to handle long-horizon tasks by storing relevant information in an external memory layer and updating it based on new experiences. (ML: 0.88)šš
- The paper discusses the importance of memory in reinforcement learning (RL) agents and proposes a new benchmark for evaluating their performance. (ML: 0.88)šš
- External Memory: A type of memory that is separate from the agent's internal memory, used to store relevant information. (ML: 0.85)šš
- Update/Rewrite Mechanisms: Methods for updating or rewriting the contents of an external memory layer based on new experiences. (ML: 0.84)šš
- Memory-Augmented Agents: RL agents that use external memory to store and retrieve information. (ML: 0.84)šš
- The authors introduce a novel approach called ELMUR (External Layer Memory with Update/Rewrite) that combines the benefits of external memory and update/rewrite mechanisms. (ML: 0.78)šš
Abstract
Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift. Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored. To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, i.e. the natural setting where an agent must rely on memory rather than current observations, and use it to compare recurrent, transformer-based, and structured memory architectures. Our experiments reveal that classic recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases. These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating. Our work highlights this overlooked challenge, introduces benchmarks to evaluate it, and offers insights for designing future RL agents with explicit and trainable forgetting mechanisms. Code: https://quartz-admirer.github.io/Memory-Rewriting/
Why we are recommending this paper?
Due to your Interest in Reinforcement Learning
This paper investigates the crucial role of adaptive memory in reinforcement learning, a key factor for long-horizon decision making. Understanding how agents manage and update their memories is central to your interest in Agentic RL.
NVIDIA
AI Insights - Previous work has shown that RL can be used to train large language models to reason and follow instructions. (ML: 0.94)šš
- However, most existing RL frameworks are designed for off-policy learning, which can lead to low hardware utilization and inefficient training. (ML: 0.93)šš
- The paper does not discuss the potential limitations of using FP8 in RL training, such as its impact on model accuracy or the need for specialized hardware to support FP8 operations. (ML: 0.92)šš
- RL: Reinforcement Learning On-policy RL: A type of RL where the policy is updated based on the experiences collected by following that policy itself. (ML: 0.91)šš
- Recent work has proposed various techniques to improve the efficiency of RL training, such as asynchronous RL frameworks and truncated importance sampling. (ML: 0.90)šš
- The paper proposes a new framework called Jet-RL that enables robust on-policy FP8 RL training by adopting an identical FP8 precision flow for both training forward pass and inference rollout stage. (ML: 0.86)šš
- By delivering up to 1.33x rollout phase speedup, up to 1.41x training phase speedup, and an 1.16x end-to-end speedup without sacrificing model accuracy, Jet-RL establishes a reliable and efficient path forward for applying FP8 computation to accelerate large-scale RL training. (ML: 0.82)šš
- FP8: A floating-point format with 4 bits for the exponent, 3 bits for the mantissa, and a sign bit. (ML: 0.70)šš
- Jet-RL robustly converges across all models and benchmarks settings, maintaining competitive performance close to the BF16 RL baseline, usually less than 1% degradation. (ML: 0.62)šš
Abstract
Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.
Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning
This paper tackles the computational efficiency of RL training, particularly focusing on quantization techniques like FP8, which is highly relevant to scaling deep reinforcement learning models. Given your interest in deep learning, this work offers valuable insights into optimizing training processes.
University of Miami
AI Insights - A study by Lin et al. (ML: 0.99)šš
- Imagine you're a delivery driver, and you have to visit many customers in a day. (ML: 0.95)šš
- The proposed method may not be suitable for stochastic environments or dynamic customer requests. (ML: 0.94)šš
- The proposed framework is like a smart planner that helps you find the best route by breaking down the problem into smaller parts and learning from experience. (ML: 0.91)šš
- This study proposes a curriculum-based deep reinforcement learning (CB-DRL) framework that structurally decomposes the electric vehicle routing problem with time windows into a learnable hierarchy of topology, energy, and time. (ML: 0.86)šš
- (2021) used deep reinforcement learning to solve the electric vehicle routing problem with time windows. (ML: 0.82)šš
- You want to plan your route efficiently so that you can deliver all the packages on time. (ML: 0.81)šš
- CB-DRL: Curriculum-Based Deep Reinforcement Learning PPO: Proximal Policy Optimization The proposed framework demonstrates superior scalability compared to traditional methods; while the exact solver and heuristic become intractable or exceed time limits for large problem sizes (Nā„40), CB-DRL efficiently generates valid solutions up to N=100. (ML: 0.81)šš
- The proposed framework demonstrates superior scalability compared to traditional methods; while the exact solver and heuristic become intractable or exceed time limits for large problem sizes (Nā„40), CB-DRL efficiently generates valid solutions up to N=100. (ML: 0.78)šš
Abstract
The electric vehicle routing problem with time windows (EVRPTW) is a complex optimization problem in sustainable logistics, where routing decisions must minimize total travel distance, fleet size, and battery usage while satisfying strict customer time constraints. Although deep reinforcement learning (DRL) has shown great potential as an alternative to classical heuristics and exact solvers, existing DRL models often struggle to maintain training stability-failing to converge or generalize when constraints are dense. In this study, we propose a curriculum-based deep reinforcement learning (CB-DRL) framework designed to resolve this instability. The framework utilizes a structured three-phase curriculum that gradually increases problem complexity: the agent first learns distance and fleet optimization (Phase A), then battery management (Phase B), and finally the full EVRPTW (Phase C). To ensure stable learning across phases, the framework employs a modified proximal policy optimization algorithm with phase-specific hyperparameters, value and advantage clipping, and adaptive learning-rate scheduling. The policy network is built upon a heterogeneous graph attention encoder enhanced by global-local attention and feature-wise linear modulation. This specialized architecture explicitly captures the distinct properties of depots, customers, and charging stations. Trained exclusively on small instances with N=10 customers, the model demonstrates robust generalization to unseen instances ranging from N=5 to N=100, significantly outperforming standard baselines on medium-scale problems. Experimental results confirm that this curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances where standard DRL baselines fail, effectively bridging the gap between neural speed and operational reliability.
Why we are recommending this paper?
Due to your Interest in Deep Learning for Reinforcement Learning