Papers from 22 to 26 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Image Recognition
👍 👎 ♥ Save
Abstract
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA's versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .
👍 👎 ♥ Save
Beijing University of P o
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.
AI Insights
  • RoPE scaling compensates for positional embedding decay, boosting spatial reasoning without hurting overall performance.
  • Fine‑tuned Qwen2‑VL‑2B and Qwen2‑VL‑7B models beat their vanilla counterparts on spatial benchmarks.
  • The method adds only a few hyperparameters, keeping training lightweight.
  • RoPE scaling consistently improves spatial accuracy across diverse test sets.
  • The technique’s benefits transfer to other VLM backbones, hinting at a broadly applicable design principle.
  • Future work could adapt RoPE scaling to balance spatial acuity with other vision‑language goals.
multimodal models
👍 👎 ♥ Save
State Key Laboratory of V
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.
AI Insights
  • Photogrammetry yields dense 3D reconstructions of historic façades, underpinning virtual walkthroughs.
  • VR tutorials raise student engagement by 30% during architectural case studies.
  • Multimodal models fusing RGB, text, and point‑cloud geometry outperform vision‑only baselines on attribute classification.
  • Fine‑tuned LLMs flag design violations in real time during BIM reviews.
  • AR overlays guide non‑experts through retrofit procedures, reducing on‑site errors by 15%.
  • Sky‑aware 3D Gaussian splatting renders unconstrained photo collections in real time, preserving atmospheric realism.
  • For deeper dives, read Gemini, Llama 2, VGGT, Mm‑Vet, and InternVL‑3, which push multimodal limits.
👍 👎 ♥ Save
Abstract
Accurate and interpretable survival analysis remains a core challenge in oncology. With growing multimodal data and the clinical need for transparent models to support validation and trust, this challenge increases in complexity. We propose an interpretable multimodal AI framework to automate survival analysis by integrating clinical variables and computed tomography imaging. Our MultiFIX-based framework uses deep learning to infer survival-relevant features that are further explained: imaging features are interpreted via Grad-CAM, while clinical variables are modeled as symbolic expressions through genetic programming. Risk estimation employs a transparent Cox regression, enabling stratification into groups with distinct survival outcomes. Using the open-source RADCURE dataset for head and neck cancer, MultiFIX achieves a C-index of 0.838 (prediction) and 0.826 (stratification), outperforming the clinical and academic baseline approaches and aligning with known prognostic markers. These results highlight the promise of interpretable multimodal AI for precision oncology with MultiFIX.
convolution
👍 👎 ♥ Save
Abstract
We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).
👍 👎 ♥ Save
Electric Power University
Abstract
In this work, we propose a novel convolution product associated with the $\mathscr{H}$-transform, denoted by $\underset{\mathscr{H}}{\ast}$, and explore its fundamental properties. Here, the $\mathscr{H}$-transform may be regarded as a refined variant of the classical Fourier, Hartley transform, with kernel function depending on two parameters $a,b$. Our first contribution shows that the space of integrable functions, equipped with multiplication given by the $\underset{\mathscr{H}}{\ast}$-convolution, constitutes the commutative Banach algebra over the complex field, albeit without an identity element. Second, we prove the Wiener--L\'evy invertibility criterion for the $\mathscr H$-algebra and formulate Gelfand's spectral radius theorem. Third, we obtain a sharp form of Young's inequality for the $\underset{\mathscr{H}}{\ast}$-convolution and its direct corollary. Finally, all of these theoretical findings are applied to investigate specific classes of the Fredholm integral equations and heat source problems, yielding a priori estimates under the established assumptions.
Image Processing
👍 👎 ♥ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Deep learning has emerged as a pivotal tool for accelerating research in the life sciences, with the low-level processing of biomedical images (e.g., registration, fusion, restoration, super-resolution) being one of its most critical applications. Platforms such as ImageJ (Fiji) and napari have enabled the development of customized plugins for various models. However, these plugins are typically based on models that are limited to specific tasks and datasets, making them less practical for biologists. To address this challenge, we introduce Orochi, the first application-oriented, efficient, and versatile image processor designed to overcome these limitations. Orochi is pre-trained on patches/volumes extracted from the raw data of over 100 publicly available studies using our Random Multi-scale Sampling strategy. We further propose Task-related Joint-embedding Pre-Training (TJP), which employs biomedical task-related degradation for self-supervision rather than relying on Masked Image Modelling (MIM), which performs poorly in downstream tasks such as registration. To ensure computational efficiency, we leverage Mamba's linear computational complexity and construct Multi-head Hierarchy Mamba. Additionally, we provide a three-tier fine-tuning framework (Full, Normal, and Light) and demonstrate that Orochi achieves comparable or superior performance to current state-of-the-art specialist models, even with lightweight parameter-efficient options. We hope that our study contributes to the development of an all-in-one workflow, thereby relieving biologists from the overwhelming task of selecting among numerous models.
fusion models
👍 👎 ♥ Save
Xian Jiaotong University
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box mechanisms are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are third-generation neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a \textbf{N}eurodynamics-\textbf{D}riven \textbf{CNP} \textbf{F}usion model (\textbf{ND-CNPFuse}) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at https://github.com/MorvanLi/ND-CNPFuse.
👍 👎 ♥ Save
University of Washington
Abstract
We propose Adaptive Multi-Style Fusion (AMSF), a reference-based training-free framework that enables controllable fusion of multiple reference styles in diffusion models. Most of the existing reference-based methods are limited by (a) acceptance of only one style image, thus prohibiting hybrid aesthetics and scalability to more styles, and (b) lack of a principled mechanism to balance several stylistic influences. AMSF mitigates these challenges by encoding all style images and textual hints with a semantic token decomposition module that is adaptively injected into every cross-attention layer of an frozen diffusion model. A similarity-aware re-weighting module then recalibrates, at each denoising step, the attention allocated to every style component, yielding balanced and user-controllable blends without any fine-tuning or external adapters. Both qualitative and quantitative evaluations show that AMSF produces multi-style fusion results that consistently outperform the state-of-the-art approaches, while its fusion design scales seamlessly to two or more styles. These capabilities position AMSF as a practical step toward expressive multi-style generation in diffusion models.
Unsubscribe from these updates