Image Recognition

Classification Filtering

Whoop, Boston, MA, USA

Abstract
We consider a streaming signal in which each sample is linked to a latent class. We assume that multiple classifiers are available, each providing class probabilities with varying degrees of accuracy. These classifiers are employed following a straightforward and fixed policy. In this setting, we consider the problem of fusing the output of the classifiers while incorporating the temporal aspect to improve classification accuracy. We propose a state-space model and develop a filter tailored for realtime execution. We demonstrate the effectiveness of the proposed filter in an activity classification application based on inertial measurement unit (IMU) data from a wearable device.

AI Insights

The filter models class probabilities with a Dirichlet prior, enabling principled Bayesian updates on streaming data.
Weak and strong classifiers are weighted separately, yielding a 3–5 % accuracy boost over uniform fusion.
A simple running‑average smoother further improves performance, demonstrating the value of temporal consistency.
The smoothing scheme can be applied without distinguishing classifier strength, simplifying deployment.
The approach generalizes to other domains such as image denoising or NLP, as suggested by the authors.
Key references include “Bayesian Filtering and Smoothing” by S. Sarkka and “Graphical Models, Exponential Families” by Wainwright & Jordan.
Core concepts: Bayesian inference updates beliefs; the Dirichlet distribution models categorical probability vectors.

👍 👎 ♥ Save

Image Realness Assessment and Localization with Multimodal Features

Indian Institute of Techn

Abstract
A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.

AI Insights

REALM fuses vision‑language embeddings to produce pixel‑wise realness scores, beating unimodal baselines by 12%. Multimodal means combining image and text data to assess realism.
Dense realness maps expose subtle artifacts like texture mismatches and color bleeding invisible to humans.
VLM‑generated captions sometimes mislabel fine‑grained distortions in human faces.
Future work fine‑tunes open‑source VLMs with human‑labeled realness descriptions for cost‑effective scalability.
RAISE provides a comparative realness metric framework that complements REALM’s multimodal approach.
AGIQA‑3K offers a diverse AI‑generated image set to benchmark REALM’s localization accuracy.
Embedding REALM’s explainable maps into generative training loops enables realness‑guided loss functions.

multimodal models

👍 👎 ♥ Save

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Abstract
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

👍 👎 ♥ Save

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng XuShengwu XiongJia

Abstract
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

AI Insights

The challenge’s literature review spotlights CNNs, RNNs, and GANs as the backbone of winning multimodal video methods.
Teams such as ActiveAlphaAgentTeam and adaboostTeam showcased hybrid architectures that fuse visual, audio, and textual cues.
Overfitting and interpretability remain the biggest hurdles, prompting researchers to explore explainable deep‑learning tricks.
The official resources include a comprehensive review book and five survey papers that map the field’s rapid evolution.
Participants leveraged the Lens dataset’s 12 daily scenarios and AdsQA’s ad‑video clips to push spatial reasoning limits.
The competition’s 40+ baselines and 15+ participant methods illustrate the vibrant cross‑disciplinary collaboration in multimodal AI.
Curious readers can dive deeper into “Multi‑Modal Video Analysis and Synthesis: A Comprehensive Review” for a thorough primer.

convolution

👍 👎 ♥ Save

Region-Aware Deformable Convolutions

Abstract
We introduce Region-Aware Deformable Convolution (RAD-Conv), a new convolutional operator that enhances neural networks' ability to adapt to complex image structures. Unlike traditional deformable convolutions, which are limited to fixed quadrilateral sampling areas, RAD-Conv uses four boundary offsets per kernel element to create flexible, rectangular regions that dynamically adjust their size and shape to match image content. This approach allows precise control over the receptive field's width and height, enabling the capture of both local details and long-range dependencies, even with small 1x1 kernels. By decoupling the receptive field's shape from the kernel's structure, RAD-Conv combines the adaptability of attention mechanisms with the efficiency of standard convolutions. This innovative design offers a practical solution for building more expressive and efficient vision models, bridging the gap between rigid convolutional architectures and computationally costly attention-based methods.

Image Processing

👍 👎 ♥ Save

Task-Aware Image Signal Processor for Advanced Visual Perception

University of Electronic

Abstract
In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.

AI Insights

TA‑ISP learns a tiny set of multi‑scale modulation operators that reshape image statistics at global, regional, and pixel levels, enabling rich spatial transforms without heavy convolutions.
Mask layers act as task‑specific attention maps, selectively amplifying or suppressing features to match the downstream detector’s receptive field.
Ablation shows removing regional modulation drops nighttime detection mAP by 3.2%, proving its role in low‑light robustness.
Compared to RAW‑Adapter and InvISP, TA‑ISP is 1.8× faster with only 0.4 M extra parameters, ideal for edge devices.
Open‑source code and visualizations illustrate how modulation operators adjust color constancy and contrast across scenes.
The ISP can be tuned on‑device for autonomous driving or surveillance, prioritizing pedestrian or license‑plate detection.
For background, explore RAW‑Adapter’s mapping and InvISP’s inverse‑learning framework, both key to TA‑ISP’s design.

fusion models

👍 👎 ♥ Save

FusionMAE: large-scale pretrained model to optimize and simplify diagnostic and control of fusion plasma

Southwestern Institute of

Abstract
In magnetically confined fusion device, the complex, multiscale, and nonlinear dynamics of plasmas necessitate the integration of extensive diagnostic systems to effectively monitor and control plasma behaviour. The complexity and uncertainty arising from these extensive systems and their tangled interrelations has long posed a significant obstacle to the acceleration of fusion energy development. In this work, a large-scale model, fusion masked auto-encoder (FusionMAE) is pre-trained to compress the information from 88 diagnostic signals into a concrete embedding, to provide a unified interface between diagnostic systems and control actuators. Two mechanisms are proposed to ensure a meaningful embedding: compression-reduction and missing-signal reconstruction. Upon completion of pre-training, the model acquires the capability for 'virtual backup diagnosis', enabling the inference of missing diagnostic data with 96.7% reliability. Furthermore, the model demonstrates three emergent capabilities: automatic data analysis, universal control-diagnosis interface, and enhancement of control performance on multiple tasks. This work pioneers large-scale AI model integration in fusion energy, demonstrating how pre-trained embeddings can simplify the system interface, reducing necessary diagnostic systems and optimize operation performance for future fusion reactors.

AI Insights

Funding comes from China’s National MCF R&D program (2024YFE03240100) and the National Natural Science Foundation (U21A20440).
The 88‑signal dataset was assembled by the Southwestern Institute of Physics, whose team is thanked.
References mix ML classics like BERT with recent preprints on disruption prediction and equilibrium reconstruction for tokamaks.
Citations include ICML and CVPR, highlighting cross‑disciplinary influence.
Recommended texts: “Deep Learning” and “Natural Language Processing (almost) from Scratch.”
Authors thank the Southwestern Institute of Physics for dataset and algorithm contributions.
The model serves as a universal control‑diagnosis interface, suggesting fewer diagnostics for future reactors.

👍 👎 ♥ Save

Real-Time Thermal State Estimation and Forecasting in Laser Powder Bed Fusion

Abstract
Laser Powder Bed Fusion (L-PBF) is a widely adopted additive manufacturing process for fabricating complex metallic parts layer by layer. Effective thermal management is essential to ensure part quality and structural integrity, as thermal gradients and residual stresses can lead to defects such as warping and cracking. However, existing experimental or computational techniques lack the ability to forecast future temperature distributions in real time, an essential capability for proactive process control. This paper presents a real-time thermal state forecasting framework for L-PBF, based on a physics-informed reduced-order thermal model integrated with a Kalman filtering scheme. The proposed approach efficiently captures inter-layer heat transfer dynamics and enables accurate tracking and forecasting of spatial and temporal temperature evolution. Validation across multiple part geometries using measured data demonstrates that the method reliably estimates and forecasts peak temperatures and cooling trends. By enabling predictive thermal control, this framework offers a practical and computationally efficient solution for thermal management in L-PBF, paving the way toward closed-loop control in L-PBF.

Help us improve your experience!