Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

University of Patras

Rate paper: 👍 👎 ♥ Save

AI Summary

Data augmentation: The process of artificially increasing the size of a dataset by applying transformations such as rotation, scaling, and flipping to the existing data. [3]
Vision Transformers (ViTs) have been gaining popularity in recent years due to their ability to outperform traditional Convolutional Neural Networks (CNNs) on various computer vision tasks. [2]

Abstract
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.

Why we think this paper is great for you:
This paper directly explores Vision Transformers, a key architecture aligning with the user's interest in exploring alternative approaches to image recognition, particularly those leveraging self-attention mechanisms.

Determination of nuclear deformations with an emulator for sub-barrier fusion reactions

Kyoto University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Based on the eigenvector continuation, we construct an emulator for coupled-channels calculations for heavy-ion fusion reactions at energies around the Coulomb barrier. We apply this to the $^{16}$O+$^{144,154}$Sm, $^{186}$W reactions and examine whether the emulator can be used to extract the deformation parameters of the target nuclei. We show that the emulator not only accelerates the calculations but also has an ability to accurately extract the nuclear shapes. This indicates that the emulator provides a powerful tool to systematically explore intrinsic shapes of atomic nuclei, enhancing our understanding of the fundamental properties of nuclear systems.

Why we think this paper is great for you:
The research focuses on fusion reactions, a complex process involving image processing and potentially relevant to understanding how models handle intricate visual data.

Isotope Production in Fusion Systems

Marathon Fusion

Rate paper: 👍 👎 ♥ Save

AI Summary

Higher neutron wall loading on the inboard side may require more inboard shielding, exacerbating the problem. [2]
The transmutation fraction, ηpro, increases as Raκ decreases in a tokamak, making compact, low-aspect-ratio, and low-elongation tokamaks ideal for maximizing feedstock neutron transmutation. [1]
The transmutation fraction in a tokamak increases as Raκ decreases, and high-temperature superconductors enable machines with high magnetic fields and low magnet cooling power. [0]

Abstract
Fusion systems producing isotopes via neutron-driven transmutation can achieve economic viability well before reaching energy breakeven. Incorporating carefully selected feedstock materials within the blanket allows fusion systems to generate both electrical power and high-value isotopes, expanding the space of viable concepts, significantly enhancing the economic value of fusion energy, and supporting an accelerated path to adoption. We calculate the value of this co-generation and derive a new economic breakeven condition based on net present value. At lower plasma gain, $Q_{\mathrm{plas}}\lesssim1-3$, high-value transmutation, such as medical radioisotopes, enables pure transmuter fusion systems operating at only a few megawatts of fusion power: for example, a 3 megawatt system transmuting ${}^{102}\mathrm{Ru}\rightarrow{}^{99}\mathrm{Mo}$ could fulfill global ${}^{99}\mathrm{Mo}$ demand with $Q_{\mathrm{plas}}\ll1$. At higher gain $Q_{\mathrm{plas}}\gtrsim3$, it becomes viable to generate electricity in addition to isotopes. For example, co-production of electricity and gold, transmuted from mercury in a fusion blanket, can reduce the required plasma gain for viability from $Q_{\mathrm{plas}}\sim10-100$ to $Q_{\mathrm{plas}}\sim3-5$. We further highlight techniques to enhance transmutation including magnetic mirrors, asymmetric neutron wall loading, and neutron multiplication. Fusion neutron-driven transmutation therefore offers a revenue-positive pathway for deploying fusion energy at terawatt-scale, starting from smaller megawatt-scale machines for radioisotope production and then scaling up to co-producing electricity and gold in larger fusion power plants.

Why we think this paper is great for you:
This paper investigates fusion systems, which could be relevant to understanding how models process and interpret complex visual data related to energy production.

Grounding Everything in Tokens for Multimodal Large Language Models

Shanghai Jiao Tong Univer

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

It provides a stable foundation for policy learning and allows for faster convergence and higher rewards. [3]
The method's advantages include its ability to capture complex mask semantics, its flexibility at inference time, and its simplicity compared to task-specific decoders. [3]
Grid Tokens addresses these limitations by providing a stable foundation for policy learning and allowing for faster convergence and higher rewards. [3]
The paper discusses a novel method for point representation, called Grid Tokens, which is designed to improve localization performance in visual language models (VLMs). [2]

Abstract
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

Why we think this paper is great for you:
The paper's focus on grounding objects within images aligns with the user's interest in multimodal models and how they process visual information.

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Sun Yatsen University

Rate paper: 👍 👎 ♥ Save

Abstract
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

Why we think this paper is great for you:
This paper investigates Chain-of-Thought reasoning in multimodal models, a crucial aspect of visual understanding and problem-solving.

Adaptive Thresholding for Visual Place Recognition using Negative Gaussian Mixture Statistics

Fordham University

Rate paper: 👍 👎 ♥ Save

Abstract
Visual place recognition (VPR) is an important component technology for camera-based mapping and navigation applications. This is a challenging problem because images of the same place may appear quite different for reasons including seasonal changes, weather illumination, structural changes to the environment, as well as transient pedestrian or vehicle traffic. Papers focusing on generating image descriptors for VPR report their results using metrics such as recall@K and ROC curves. However, for a robot implementation, determining which matches are sufficiently good is often reduced to a manually set threshold. And it is difficult to manually select a threshold that will work for a variety of visual scenarios. This paper addresses the problem of automatically selecting a threshold for VPR by looking at the 'negative' Gaussian mixture statistics for a place - image statistics indicating not this place. We show that this approach can be used to select thresholds that work well for a variety of image databases and image descriptors.

Why we think this paper is great for you:
The research centers on Visual Place Recognition, a field that heavily relies on image processing and pattern recognition techniques.

Video Depth Propagation

ETH Zurich

Rate paper: 👍 👎 ♥ Save

Abstract
Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth

Why we think this paper is great for you:
This paper deals with depth estimation in videos, a core component of visual perception and a key area within image processing.

Interests not found

Help us improve your experience!