Abstract
Despite significant progress, existing research on Multimodal Large Language
Models (MLLMs) mainly focuses on general visual understanding, overlooking the
ability to integrate textual context associated with objects for a more
context-aware multimodal understanding -- an ability we refer to as
Region-level Context-aware Multimodal Understanding (RCMU). To address this
limitation, we first formulate the RCMU task, which requires models to respond
to user instructions by integrating both image content and textual information
of regions or objects. To equip MLLMs with RCMU capabilities, we propose
Region-level Context-aware Visual Instruction Tuning (RCVIT), which
incorporates object information into the model input and enables the model to
utilize bounding box coordinates to effectively associate objects' visual
content with their textual information. To address the lack of datasets, we
introduce the RCMU dataset, a large-scale visual instruction tuning dataset
that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive
benchmark that can evaluate the performance of MLLMs in RCMU and multimodal
personalized understanding tasks. Additionally, we propose a reference-free
evaluation metric to perform a comprehensive and fine-grained evaluation of the
region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL
models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental
results indicate that RC-Qwen2-VL models not only achieve outstanding
performance on multiple RCMU tasks but also demonstrate successful applications
in multimodal RAG and personalized conversation. Our data, model and benchmark
are available at https://github.com/hongliang-wei/RC-MLLM
Abstract
As AI-generated content becomes widespread, so does the risk of
misinformation. While prior research has primarily focused on identifying
whether content is authentic, much less is known about how such content
influences human perception and behavior. In domains like trading or the stock
market, predicting how people react (e.g., whether a news post will go viral),
can be more critical than verifying its factual accuracy. To address this, we
take a human-centered approach and introduce the MhAIM Dataset, which contains
154,552 online posts (111,153 of them AI-generated), enabling large-scale
analysis of how people respond to AI-generated content. Our human study reveals
that people are better at identifying AI content when posts include both text
and visuals, particularly when inconsistencies exist between the two. We
propose three new metrics: trustworthiness, impact, and openness, to quantify
how users judge and engage with online content. We present T-Lens, an LLM-based
agent system designed to answer user queries by incorporating predicted human
responses to multimodal information. At its core is HR-MCP (Human Response
Model Context Protocol), built on the standardized Model Context Protocol
(MCP), enabling seamless integration with any LLM. This integration allows
T-Lens to better align with human reactions, enhancing both interpretability
and interaction capabilities. Our work provides empirical insights and
practical tools to equip LLMs with human-awareness capabilities. By
highlighting the complex interplay among AI, human cognition, and information
reception, our findings suggest actionable strategies for mitigating the risks
of AI-driven misinformation.