Image Recognition

Image Recognition with Vision and Language Embeddings of VLMs

Czech Technical Universty

Rate this image: 😍 👍 👎

Abstract
Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

AI Insights

No single VLM dominates across both ImageNet‑1k splits, highlighting dataset‑specific strengths.
OpenCLIP ViT‑H‑14‑378 leads with 93.37 % on the cleaner split and 84.54 % on the raw split.
SigLIP 2’s k‑NN space shows per‑class optimal neighbors, boosting accuracy by up to 20 %.
Fine‑grained categories with high label noise inflate perceived model weaknesses.
A learning‑free fusion that weights per‑class precision blends language and vision, improving accuracy.
Some classes favor textual prompts, others visual similarity, revealing hybrid inference opportunities.

👍 👎 ♥ Save

Hybrid Quantum-Classical Model for Image Classification

Abstract
This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with $\epsilon=0.1$ perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving {99.38\% (MNIST), 41.69\% (CIFAR100), and 74.05\% (STL10) validation accuracy, compared to classical benchmarks of 98.21\%, 32.25\%, and 63.76\%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44\%) and STL10 (+10.29\%). Hybrid models also train 5--12$\times$ faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6--32\% fewer parameters} while maintaining superior generalization to unseen test data.Adversarial robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27\% robust accuracy on MNIST vs. 10.80\% for classical) but show comparable fragility on complex datasets like CIFAR100 ($\sim$1\% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4--5GB vs. 5--6GB for classical) and lower CPU utilization (9.5\% vs. 23.2\% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.

multimodal models

👍 👎 ♥ Save

Visual Representation Alignment for Multimodal Large Language Models

KAIST AI, NYU, ChungAng

Abstract
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

AI Insights

ℓ loss is a cosine penalty that aligns the MLLM’s visual embeddings with a VFM, acting as lightweight distillation.
Using CLIP, DINOv2, SAM, and RADIO‑v2.5 lets VIRAL inherit diverse visual priors, each imprinting distinct structure.
Layer‑wise PCA shows VIRAL sharpens the feature manifold, preserving fine‑grained spatial patterns across layers.
Aligning with SAM’s segmentation maps injects explicit object boundaries, boosting counting and localization.
On LLaVA‑1.5, VIRAL improves text‑to‑image generation and visual QA, especially spatial‑reasoning tasks.
Ablations reveal DINOv2 alignment gives best accuracy, while CLIP offers strongest generalization.
Open‑source VIRAL checkpoints invite alignment with newer models like SigLIPv2 or updated SAM.

👍 👎 ♥ Save

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Abstract
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.

convolution

👍 👎 ♥ Save

Smoothed Shifted Convolutions of Generalised Divisor Functions

University of Cambridge

Abstract
We prove an asymptotic formula for the smoothed shifted convolution of the generalised divisor function $d_k(n)$ and the divisor function $d(n)$, with a power-saving error term independent of $k$. In particular, when $k$ is large, this is an improvement on Topacogullari (2018).

AI Insights

The analysis splits the shift parameter h into short, medium, and long ranges, each handled with a tailored contour integral estimate.
By exploiting a smoothed weight function, the author removes the dependence on k from the error term, a novelty absent in earlier works.
The proof blends Dirichlet series factorization with a refined application of the Voronoi summation formula for d_k(n).
A key technical step is bounding the shifted convolution sum via a bilinear form estimate that leverages large sieve inequalities.
The resulting asymptotic holds uniformly for k up to exp((log X)^{1/2}), extending the reach of previous bounds.
The paper outlines a conjectural refinement suggesting that the error term could be reduced to O(X^{1/2+ε}) using spectral theory of automorphic forms.
Readers interested in the analytic machinery may consult Iwaniec–Kowalski’s “Analytic Number Theory” for background on the techniques employed.

Image Processing

👍 👎 ♥ Save

Beyond Sliders: Mastering the Art of Diffusion-based Image Manipulation

Abstract
In the realm of image generation, the quest for realism and customization has never been more pressing. While existing methods like concept sliders have made strides, they often falter when it comes to no-AIGC images, particularly images captured in real world settings. To bridge this gap, we introduce Beyond Sliders, an innovative framework that integrates GANs and diffusion models to facilitate sophisticated image manipulation across diverse image categories. Improved upon concept sliders, our method refines the image through fine grained guidance both textual and visual in an adversarial manner, leading to a marked enhancement in image quality and realism. Extensive experimental validation confirms the robustness and versatility of Beyond Sliders across a spectrum of applications.

👍 👎 ♥ Save

Computational Imaging for Enhanced Computer Vision

SRM Institute of Science

Abstract
This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.

fusion models

👍 👎 ♥ Save

FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model

Foshan University

Abstract
Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.

👍 👎 ♥ Save

Learning words in groups: fusion algebras, tensor ranks and grokking

Technion IsraelInstitute

Abstract
In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.

Help us improve your experience!