Nanyang Technological Unv
Abstract
With the recent emergence of revolutionary autonomous agentic systems,
research community is witnessing a significant shift from traditional static,
passive, and domain-specific AI agents toward more dynamic, proactive, and
generalizable agentic AI. Motivated by the growing interest in agentic AI and
its potential trajectory toward AGI, we present a comprehensive survey on
Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we
explore the emerging paradigm of agentic MLLMs, delineating their conceptual
foundations and distinguishing characteristics from conventional MLLM-based
agents. We establish a conceptual framework that organizes agentic MLLMs along
three fundamental dimensions: (i) Agentic internal intelligence functions as
the system's commander, enabling accurate long-horizon planning through
reasoning, reflection, and memory; (ii) Agentic external tool invocation,
whereby models proactively use various external tools to extend their
problem-solving capabilities beyond their intrinsic knowledge; and (iii)
Agentic environment interaction further situates models within virtual or
physical environments, allowing them to take actions, adapt strategies, and
sustain goal-directed behavior in dynamic real-world scenarios. To further
accelerate research in this area for the community, we compile open-source
training frameworks, training and evaluation datasets for developing agentic
MLLMs. Finally, we review the downstream applications of agentic MLLMs and
outline future research directions for this rapidly evolving field. To
continuously track developments in this rapidly evolving field, we will also
actively update a public repository at
https://github.com/HJYao00/Awesome-Agentic-MLLMs.
AI Insights - Reinforcement learning with many alternatives is highlighted as a key training method for agentic MLLMs.
- Formal verification for autonomous systems is a vital safety guarantee for agentic MLLMs.
- A curated safety evaluation framework list benchmarks multimodal LLM robustness.
- The book Human Compatible is cited to discuss ethical control of agentic AI.
- Microsoft and Google deployments show agentic MLLMs in medical research and autonomous driving.
- Key papers—Multimodal LLMs for Medical Research, Autonomous Driving with MLLMs, and Safety Evaluation Frameworks for MLLMs—are highlighted.
- Transparency and accountability gaps are flagged as top limitations, urging explainable agentic MLLMs.
Abstract
Multimodal large language models (MLLMs) integrate image features from visual
encoders with LLMs, demonstrating advanced comprehension capabilities. However,
mainstream MLLMs are solely supervised by the next-token prediction of textual
tokens, neglecting critical vision-centric information essential for analytical
abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM
representations through Vision-Centric activation and Coordination from
multiple vision foundation models (VFMs). VaCo introduces visual discriminative
alignment to integrate task-aware perceptual features extracted from VFMs,
thereby unifying the optimization of both textual and visual outputs in MLLMs.
Specifically, we incorporate the learnable Modular Task Queries (MTQs) and
Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals
under the supervision of diverse VFMs. To coordinate representation conflicts
across VFMs, the crafted Token Gateway Mask (TGM) restricts the information
flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo
significantly improves the performance of different MLLMs on various
benchmarks, showcasing its superior capabilities in visual comprehension.