Universit degli Studi di
AI Insights - The choice of activation function significantly affects the performance of the network, with the tanh(x) function yielding better results than the sigmoid (logsig) function for the DTT configuration. [3]
- RMSD: Root Mean Square Deviation, a measure of the error between two sets of data. [3]
- PINN: Physics-Informed Neural Network, a type of neural network that incorporates physical laws and constraints into its architecture. [3]
- The paper presents a physics-informed neural network (PINN) approach for reconstructing the last closed flux surface (LCFS) in tokamaks, which is crucial for plasma confinement and stability. [2]
- The authors demonstrate that their approach can accurately reconstruct the LCFS on small circular machines like RFX-mod2 with an RMSD of less than 1 cm, and on larger devices like DTT with a higher number of magnetic measurements. [1]
- The proposed method uses a physics-based loss function to enforce the Grad-Shafranov equation in the vacuum region of the tokamak. [0]
- The proposed approach has the potential to become a valuable tool for plasma physics research and applications, enabling fast and accurate reconstruction of the LCFS in various tokamak configurations. [0]
- Grad-Shafranov equation: A partial differential equation that describes the magnetic field in a tokamak. [0]
Abstract
In this work, we propose a novel physics informed neural network based algorithm for real time plasma boundary reconstruction in tokamak devices. The approach is based on a single Extreme Learning Machine network used to solve the homogeneous Grad Shafranov equation, which is required to identify the plasma boundary. This architecture enables the real time training of the network parameters using the available magnetic sensor data and, consequently, dynamically adapting the network output to the evolving plasma equilibrium. We demonstrate that, the network performs accurate plasma boundary reconstruction for complex configurations, outperforming well established methods, such as the algorithm used for decades at the Joint European Torus, the world's largest tokamak, until it ceased operation in 2023. Indeed, compared to the latter, the proposed solution better generalizes the poloidal flux function, without requiring algorithm retuning across different plasma equilibria. The proposed neural network reconstructor demonstrates also greater robustness with respect to noise on the magnetic measurements. Moreover, this method takes advantage of the generalization power of neural networks but without the need for extensive, time consuming training based on a huge amount of experimental data, making its implementation on existing devices straightforward.
Why we are recommending this paper?
Due to your Interest in: fusion models
This paper directly addresses plasma boundary reconstruction, a key area within fusion energy research, aligning with your interest in fusion models. The use of physics-informed neural networks is particularly relevant to your interest in convolution and image processing techniques.
University of Bristol
Abstract
Kernel-based methods such as Rocket are among the most effective default approaches for univariate time series classification (TSC), yet they do not perform equally well across all datasets. We revisit the long-standing intuition that different representations capture complementary structure and show that selectively fusing them can yield consistent improvements over Rocket on specific, systematically identifiable kinds of datasets. We introduce Fusion-3 (F3), a lightweight framework that adaptively fuses Rocket, Sax, and Sfa representations. To understand when fusion helps, we cluster UCR datasets into six groups using meta-features capturing series length, spectral structure, roughness, and class imbalance, and treat these clusters as interpretable data-structure regimes. Our analysis shows that fusion typically outperforms strong baselines in regimes with structured variability or rich frequency content, while offering diminishing returns in highly irregular or outlier-heavy settings. To support these findings, we combine three complementary analyses: non-parametric paired statistics across datasets, ablation studies isolating the roles of individual representations, and attribution via SHAP to identify which dataset properties predict fusion gains. Sample-level case studies further reveal the underlying mechanism: fusion primarily improves performance by rescuing specific errors, with adaptive increases in frequency-domain weighting precisely where corrections occur. Using 5-fold cross-validation on the 113 UCR datasets, F3 yields small but consistent average improvements over Rocket, supported by frequentist and Bayesian evidence and accompanied by clearly identifiable failure cases. Our results show that selectively applied fusion provides dependable and interpretable extension to strong kernel-based methods, correcting their weaknesses precisely where the data support it.
Why we are recommending this paper?
Due to your Interest in: fusion models
Given your interest in multimodal models and image recognition, this paper's focus on time series classification using kernel-based methods offers a valuable perspective. The exploration of different representations for capturing structure is directly applicable to your interests.
University of Michigan
AI Insights - Cornserve's planner decides to disaggregate the LLM (1 replica) and the audio generator (7 and 15 replicas, respectively) to balance the throughput of each component as much as possible. [3]
- Qwen 3 Omni [45] and Qwen 2.5 Omni [44] are multimodal input & output models that take a combination of text, images, video, and audio as input, and generates either text or audio. [3]
- Any-to-Any models: generic models that can handle multiple input and output modalities. [2]
- Cornserve improves the throughput of serving Qwen 2.5 Omni [44] over the baseline monolithic deployment on 8-GPU and 16-GPU cells by 3.09 × and 3.81 ×, respectively. [1]
- Qwen 2.5 Omni [44]: a multimodal input & output model that takes a combination of text, images, video, and audio as input, and generates either text or audio. [0]
Abstract
We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving.
Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve's planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve's distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.
Why we are recommending this paper?
Due to your Interest in: multimodal models
This paper’s focus on ‘Any-to-Any’ multimodal models aligns strongly with your interest in multimodal models and their ability to handle diverse data types. The emphasis on efficient serving systems is a practical consideration for your research.
University of Modena and
AI Insights - These models have the potential to improve human-computer interaction, decision-making, and problem-solving in various domains. [3]
- The model learns to predict the missing information or reconstruct the input data. [3]
- The field of multimodal learning is rapidly evolving with the development of new architectures and techniques. [2]
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.
Why we are recommending this paper?
Due to your Interest in: multimodal models
This paper tackles the visual reasoning limitations of multimodal large language models, directly addressing your interest in image processing and image recognition. The exploration of self-supervised learning is a key technique for improving visual understanding.
CUHK
AI Insights - The use of visual tools and external knowledge can enhance the reasoning abilities of VLMs. [3]
- Self-reflection and reflection-based reinforcement learning are effective methods for improving the reasoning capabilities of VLMs. [3]
- Curiosity-Driven Reinforcement Learning: A variant of reinforcement learning where the agent is motivated by curiosity rather than a fixed reward signal. [3]
- The development of VLMs has made significant progress in recent years, with many models achieving state-of-the-art performance on various tasks. [3]
- Reinforcement learning is a key technique for improving the reasoning capabilities of vision-language models (VLMs). [2]
Abstract
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
Why we are recommending this paper?
Due to your Interest in: Image Processing
This work's investigation into adaptive tool-use with multimodal large language models is highly relevant to your interests in multimodal models and their interaction with visual tools. The focus on efficient tool invocation aligns with practical considerations for image and video processing.
Carnegie Mellon
Abstract
Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
Why we are recommending this paper?
Due to your Interest in: Image Processing
Kyoto University
Abstract
We show that the distribution of the spectral maximum of monotonically independent self-adjoint operators coincides with the classical max-convolution of their distributions. In free probability, it was proven that for any probability measures $σ,μ$ on $\mathbb{R}$ there is a unique probability measure $\mathbb{A}_σ(μ)$ satisfying $σ\boxplus μ= σ\triangleright \mathbb{A}_σ(μ)$, where $\boxplus$ and $\triangleright$ are free and monotone additive convolutions, respectively. We recall that the reciprocal Cauchy transform of $\mathbb{A}_σ(μ)$ is the subordination function for free additive convolution. Motivated by this analogy, we introduce subordination functions for free max-convolution and prove their existence and structural properties.
AI Insights - Anal. [3]
- 101(2007), 357–365." ] } { "name": "Boolean Convolution", "description": "A type of convolution operation in free probability theory that combines two measures using the Boolean product.", "source": [ "[18] R. [3]
- Woroudi, Boolean convolution, in Fre e Probability Theory (Wa- terloo, ON, 1995), Fields Institute Communications , Vol. [3]
- { "title": "Monotone Additive Convolution and Max-Convolution in Free Probability Theory", "abstract": "This paper discusses the relationship between monotone additive convolution and max-convolution in free probability theory. [2]
- Quantum Probab. [1]
- Belinschi and H. [0]
Why we are recommending this paper?
Due to your Interest in: convolution
Samsung Research
Abstract
Accurately learning high-frequency signals is a challenge in computer vision and graphics, as neural networks often struggle with these signals due to spectral bias or optimization difficulties. While current techniques like Fourier encodings have made great strides in improving performance, there remains scope for improvement when presented with high-frequency information. This paper introduces Queried-Convolutions (Qonvolutions), a simple yet powerful modification using the neighborhood properties of convolution. Qonvolution convolves a low-frequency signal with queries (such as coordinates) to enhance the learning of intricate high-frequency signals. We empirically demonstrate that Qonvolutions enhance performance across a variety of high-frequency learning tasks crucial to both the computer vision and graphics communities, including 1D regression, 2D super-resolution, 2D image regression, and novel view synthesis (NVS). In particular, by combining Gaussian splatting with Qonvolutions for NVS, we showcase state-of-the-art performance on real-world complex scenes, even outperforming powerful radiance field models on image quality.
Why we are recommending this paper?
Due to your Interest in: convolution
University of Southampton
Abstract
A complex system comprises multiple interacting entities whose interdependencies form a unified whole, exhibiting emergent behaviours not present in individual components. Examples include the human brain, living cells, soft matter, Earth's climate, ecosystems, and the economy. These systems exhibit high-dimensional, non-linear dynamics, making their modelling, classification, and prediction particularly challenging. Advances in information technology have enabled data-driven approaches to studying such systems. However, the sheer volume and complexity of spatio-temporal data often hinder traditional methods like dimensionality reduction, phase-space reconstruction, and attractor characterisation. This paper introduces a geometric framework for analysing spatio-temporal data from complex systems, grounded in the theory of vector fields over discrete measure spaces. We propose a two-parameter family of metrics suitable for data analysis and machine learning applications. The framework supports time-dependent images, image gradients, and real- or vector-valued functions defined on graphs and simplicial complexes. We validate our approach using data from numerical simulations of biological and physical systems on flat and curved domains. Our results show that the proposed metrics, combined with multidimensional scaling, effectively address key analytical challenges. They enable dimensionality reduction, mode decomposition, phase-space reconstruction, and attractor characterisation. Our findings offer a robust pathway for understanding complex dynamical systems, especially in contexts where traditional modelling is impractical but abundant experimental data are available.
AI Insights - The method can be used to analyze high-dimensional unstructured data sets, including RGB images with high resolution. [3]
- The paper presents a geometric framework for analyzing high-dimensional spatio-temporal data from complex systems. [2]
- The approach is demonstrated on two case studies: the Ginzburg-Landau equation and the Gray-Scott equation. [1]
Why we are recommending this paper?
Due to your Interest in: Image Recognition
FriedrichAlexanderUnter
Abstract
Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at https://github.com/MischaD/LCMem.
Why we are recommending this paper?
Due to your Interest in: Image Recognition