Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

Mathematical reasoning in visual contexts is a challenging task for foundation models. [2]
Mathematical reasoning: the ability of a model to understand and apply mathematical concepts and operations in visual contexts. [1]

Abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

Why we think this paper is great for you:
This paper directly explores agentic learning and memory-augmented agents, aligning perfectly with your interests. It delves into how agents can learn more effectively by refining their semantic memory.

Agint: Agentic Graph Compilation for Software Engineering Agents

Rate paper: 👍 👎 ♥ Save

AI Summary

The research presents a software system for code generation without involving human subjects, sensitive data, or ethically concerning applications. [3]
Societal impacts: The potential positive or negative effects of a research paper on society, including issues related to fairness, privacy, security, and ethics. [3]
The paper does not release pretrained models, image generators, or scraped datasets. [3]
No external code or datasets requiring special licensing are incorporated. [3]
The authors should cite the original paper that produced the code package or dataset. [3]
The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. [2]
Technical and Demo paper: A type of research paper that presents a new system or tool, often with a demo or prototype. [1]

Abstract
LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph compiler, interpreter, and runtime that incrementally and hierarchically converts natural-language instructions into typed, effect-aware code DAGs. Agint introduces explicit type floors (text to data to spec to code) grounded in semantic graph transformations and a hybrid LLM and function-based JIT runtime. This enables dynamic graph refinement, reproducible and optimizable execution, speculative evaluation, and interoperability with existing developer tools. Agint's typed graph bindings improve reliability and allow concurrent composition of concurrent codebases by construction, supporting accelerated development with smaller and faster models, lower latency, efficient context utilization, and higher throughput. Hierarchical compilation allows scalable graph edits, while the graph structure supports reproducibility and efficient parallel generation. Agint provides a composable unix-style toolchain: dagify (DAG compiler), dagent (hybrid JIT runtime), schemagin (schema generator), and datagin (data transformer) for realtime, low-latency code and dataflow creation. Human developers and coding agents refine graphs through the Agint CLI, while non-technical users use Agint Flow GUI for visual editing, conversational refinement, and debugging to promote prototype agentic workflows to production code. This continuous co-creation model allows teams to prototype quickly, refine seamlessly, and deploy reliably, bridging natural language, compiler methods, and developer tooling to enable a new generation of composable, team-centric coding agents at scale.

Why we think this paper is great for you:
You'll find this paper highly relevant as it focuses on agentic systems, specifically an agentic graph compiler for software engineering. It addresses challenges in building robust LLM-based coding agents.

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

DR Tulu-8B is a deep research system that outperforms proprietary and open deep research systems in various aspects. [3]
It achieves strong gains over its base model and competitive performance with closed DR models on GeneticDiseasesQA, a benchmark for researching pathogenic gene variants. [3]
RLER improves deep research quality across diverse aspects, including rubric coverage, answer precision, comprehensiveness, depth of response, citation precision, and recall. [3]
DR Tulu-8B is significantly cheaper than other proprietary and open deep research systems, with a cost advantage of approximately USD 0.00008 per query. [3]
Deep Research (DR) system: A type of AI model designed to perform complex research tasks by aggregating information from various sources. [3]
RLER: Reinforcement Learning for Evidence Retrieval, a training method used to improve the performance of deep research systems. [3]
DR Tulu-8B is a state-of-the-art deep research system that achieves strong gains over its base model and competitive performance with closed DR models. [3]
RLER improves deep research quality across diverse aspects, making it an effective method for training deep research systems. [3]
The cost-effectiveness of DR Tulu-8B makes it an attractive option for researchers and practitioners who need to perform complex research tasks. [3]
GeneticDiseasesQA: An evaluation dataset consisting of 47 questions derived from expert-curated information about disease-causing genetic variants. [1]

Abstract
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

Why we think this paper is great for you:
This paper directly combines deep research models with reinforcement learning, offering insights into training models for complex, long-form tasks. It explores how RL can be applied to deep research.

Leveraging weights signals -- Predicting and improving generalizability in reinforcement learning

Rate paper: 👍 👎 ♥ Save

AI Summary

The paper presents a method to improve the generalizability of reinforcement learning (RL) agents by using a predictor to estimate their performance on unseen environments. [3]
The approach is based on the idea that RL agents can be seen as complex systems with multiple interacting components, and that understanding these interactions can help improve their generalizability. [3]
The results show that the proposed approach can significantly improve the generalizability of RL agents, especially in complex tasks with multiple interacting components. [3]
Reinforcement Learning (RL): A type of machine learning where an agent learns to take actions in an environment to maximize a reward. [3]
Generalizability: The ability of an RL agent to perform well on unseen environments or tasks, beyond the ones it was trained on. [3]
Predictor: A model that estimates an agent's performance on unseen environments based on its weights and architecture. [3]
The proposed approach can significantly improve the generalizability of RL agents, especially in complex tasks with multiple interacting components. [3]
Understanding the interactions between different components of an RL agent can help improve its generalizability. [3]
The authors use a combination of techniques from computer vision and natural language processing to develop a predictor that can estimate an agent's performance on unseen environments. [2]

Abstract
Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent's neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.

Why we think this paper is great for you:
This paper addresses a critical challenge in reinforcement learning: improving the generalizability of RL agents. You will find its methods for predicting and enhancing agent performance across varied environments very useful.

Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning

Rate paper: 👍 👎 ♥ Save

AI Summary

This paper presents an asynchronous on-policy reinforcement learning implementation that maintains the same training stability as synchronous RL methods. [3]
Asynchronous on-policy reinforcement learning Synchronous RL methods Micro-batches Prompt-level asynchronous parallelism Group-mask attention mechanism The proposed framework demonstrates near-linear scaling in training speed as the number of devices increases. [3]
The decoupling of training and inference allows both to be scaled independently and flexibly, achieving optimal balance and maximizing the speed-to-device ratio. [3]
The proposed framework achieves a three- to five-fold improvement in end-to-end performance over current mainstream RL training frameworks on the NPU platform. [2]

Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.

Why we think this paper is great for you:
This work focuses on accelerating on-policy reinforcement learning, a key area for improving the efficiency of your deep learning applications. It offers a method to tackle the challenge of training efficiency in RL.

Reinforcement Learning with $ω$-Regular Objectives and Constraints

Rate paper: 👍 👎 ♥ Save

Abstract
Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $ω$-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $ω$-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $ω$-regular objective while also adhering to $ω$-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.

Why we think this paper is great for you:
This paper expands on the fundamental aspects of reinforcement learning by introducing $ω$-regular objectives and constraints. It provides a more expressive way to define complex, safety-critical goals for RL agents.

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

Mathematical reasoning in visual contexts is a challenging task for foundation models. [2]
Mathematical reasoning: the ability of a model to understand and apply mathematical concepts and operations in visual contexts. [1]

Abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

Why we think this paper is great for you:
This paper is also highly relevant for its exploration of multimodal semantic memory in agentic learners. It offers valuable perspectives on how agents can overcome limitations of traditional memory systems.

Help us improve your experience!