Hi!

Your personalized paper recommendations for 08 to 12 December, 2025.
🎯 Top Personalized Recommendations
University of Patras
Rate paper: 👍 👎 ♥ Save
AI Summary
  • Data augmentation: The process of artificially increasing the size of a dataset by applying transformations such as rotation, scaling, and flipping to the existing data. [3]
  • Vision Transformers (ViTs) have been gaining popularity in recent years due to their ability to outperform traditional Convolutional Neural Networks (CNNs) on various computer vision tasks. [2]
Abstract
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
Why we think this paper is great for you:
This paper directly explores Vision Transformers, a key architecture aligning with the user's interest in exploring alternative approaches to image recognition, particularly those leveraging self-attention mechanisms.
Kyoto University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
Abstract
Based on the eigenvector continuation, we construct an emulator for coupled-channels calculations for heavy-ion fusion reactions at energies around the Coulomb barrier. We apply this to the $^{16}$O+$^{144,154}$Sm, $^{186}$W reactions and examine whether the emulator can be used to extract the deformation parameters of the target nuclei. We show that the emulator not only accelerates the calculations but also has an ability to accurately extract the nuclear shapes. This indicates that the emulator provides a powerful tool to systematically explore intrinsic shapes of atomic nuclei, enhancing our understanding of the fundamental properties of nuclear systems.
Why we think this paper is great for you:
The research focuses on fusion reactions, a complex process involving image processing and potentially relevant to understanding how models handle intricate visual data.
Marathon Fusion
Rate paper: 👍 👎 ♥ Save
AI Summary
  • Higher neutron wall loading on the inboard side may require more inboard shielding, exacerbating the problem. [2]
  • The transmutation fraction, ηpro, increases as Raκ decreases in a tokamak, making compact, low-aspect-ratio, and low-elongation tokamaks ideal for maximizing feedstock neutron transmutation. [1]
  • The transmutation fraction in a tokamak increases as Raκ decreases, and high-temperature superconductors enable machines with high magnetic fields and low magnet cooling power. [0]
Abstract
Fusion systems producing isotopes via neutron-driven transmutation can achieve economic viability well before reaching energy breakeven. Incorporating carefully selected feedstock materials within the blanket allows fusion systems to generate both electrical power and high-value isotopes, expanding the space of viable concepts, significantly enhancing the economic value of fusion energy, and supporting an accelerated path to adoption. We calculate the value of this co-generation and derive a new economic breakeven condition based on net present value. At lower plasma gain, $Q_{\mathrm{plas}}\lesssim1-3$, high-value transmutation, such as medical radioisotopes, enables pure transmuter fusion systems operating at only a few megawatts of fusion power: for example, a 3 megawatt system transmuting ${}^{102}\mathrm{Ru}\rightarrow{}^{99}\mathrm{Mo}$ could fulfill global ${}^{99}\mathrm{Mo}$ demand with $Q_{\mathrm{plas}}\ll1$. At higher gain $Q_{\mathrm{plas}}\gtrsim3$, it becomes viable to generate electricity in addition to isotopes. For example, co-production of electricity and gold, transmuted from mercury in a fusion blanket, can reduce the required plasma gain for viability from $Q_{\mathrm{plas}}\sim10-100$ to $Q_{\mathrm{plas}}\sim3-5$. We further highlight techniques to enhance transmutation including magnetic mirrors, asymmetric neutron wall loading, and neutron multiplication. Fusion neutron-driven transmutation therefore offers a revenue-positive pathway for deploying fusion energy at terawatt-scale, starting from smaller megawatt-scale machines for radioisotope production and then scaling up to co-producing electricity and gold in larger fusion power plants.
Why we think this paper is great for you:
This paper investigates fusion systems, which could be relevant to understanding how models process and interpret complex visual data related to energy production.
Shanghai Jiao Tong Univer
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Summary
  • It provides a stable foundation for policy learning and allows for faster convergence and higher rewards. [3]
  • The method's advantages include its ability to capture complex mask semantics, its flexibility at inference time, and its simplicity compared to task-specific decoders. [3]
  • Grid Tokens addresses these limitations by providing a stable foundation for policy learning and allowing for faster convergence and higher rewards. [3]
  • The paper discusses a novel method for point representation, called Grid Tokens, which is designed to improve localization performance in visual language models (VLMs). [2]
Abstract
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
Why we think this paper is great for you:
The paper's focus on grounding objects within images aligns with the user's interest in multimodal models and how they process visual information.
Sun Yatsen University
Rate paper: 👍 👎 ♥ Save
Abstract
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
Why we think this paper is great for you:
This paper investigates Chain-of-Thought reasoning in multimodal models, a crucial aspect of visual understanding and problem-solving.
Fordham University
Rate paper: 👍 👎 ♥ Save
Abstract
Visual place recognition (VPR) is an important component technology for camera-based mapping and navigation applications. This is a challenging problem because images of the same place may appear quite different for reasons including seasonal changes, weather illumination, structural changes to the environment, as well as transient pedestrian or vehicle traffic. Papers focusing on generating image descriptors for VPR report their results using metrics such as recall@K and ROC curves. However, for a robot implementation, determining which matches are sufficiently good is often reduced to a manually set threshold. And it is difficult to manually select a threshold that will work for a variety of visual scenarios. This paper addresses the problem of automatically selecting a threshold for VPR by looking at the 'negative' Gaussian mixture statistics for a place - image statistics indicating not this place. We show that this approach can be used to select thresholds that work well for a variety of image databases and image descriptors.
Why we think this paper is great for you:
The research centers on Visual Place Recognition, a field that heavily relies on image processing and pattern recognition techniques.
ETH Zurich
Rate paper: 👍 👎 ♥ Save
Abstract
Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
Why we think this paper is great for you:
This paper deals with depth estimation in videos, a core component of visual perception and a key area within image processing.
Image Processing
University of Washington
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
Abstract
Optical neural networks (ONNs) are gaining increasing attention to accelerate machine learning tasks. In particular, static meta-optical encoders designed for task-specific pre-processing demonstrated orders of magnitude smaller energy consumption over purely digital counterpart, albeit at the cost of slight degradation in classification accuracy. However, a lack of generalizability poses serious challenges for wide deployment of static meta-optical front-ends. Here, we investigate the utility of a metalens for generalized computer vision. Specifically, we show that a metalens optimized for full-color imaging can achieve image classification accuracy comparable to high-end, sensor-limited optics and consistently outperforms a hyperboloid metalens across a wide range of sensor pixel sizes. We further design an end-to-end single aperture metasurface for ImageNet classification and find that the optimized metasurface tends to balance the modulation transfer function (MTF) for each wavelength. Together, these findings highlight that the preservation of spatial frequency-domain information is an essential interpretable factor underlying ONN performance. Our work provides both an interpretable understanding of task-driven optical optimization and practical guidance for designing high-performance ONNs and meta-optical encoders for generalizable computer vision.
AI Summary
  • The learned phase coefficients exhibit substantial fluctuations even after convergence. [3]
  • End-to-end optimization may settle into local minima or flat regions of the loss landscape, causing instability. [3]
  • The end-to-end designed metalens preserves higher color fidelity and smoother global intensity gradients across the field, with reduced inter-channel crosstalk and more accurate saturation in chromatic regions. [2]
  • RCWA: Rigorous Coupled-Wave Analysis, a numerical method for simulating the behavior of light in periodic structures. [0]
  • The end-to-end optimized metalens produces a more balanced and uniform in-band MTF across the RGB channels compared to the hyperboloid lens. [0]
  • The MTF of the end-to-end optimized encoder is more balanced and uniform in-band compared to the hyperboloid lens. [0]

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • convolution
You can edit or add more interests any time.