Hi!

Your personalized paper recommendations for 24 to 28 November, 2025.
๐ŸŽฏ Top Personalized Recommendations
Paper visualization
Rate image: ๐Ÿ‘ ๐Ÿ‘Ž
Abstract
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
Why we think this paper is great for you:
This paper directly addresses the evaluation of visual understanding in advanced language models, which is highly relevant to your interest in combining different data modalities. It explores how these models interpret complex visual information, a key area of research.
Abstract
Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.
Why we think this paper is great for you:
You will find this paper particularly interesting as it delves into optimizing large-scale models by efficiently pruning expert networks. It directly combines your interests in specialized architectures and efficient model deployment.
Paper visualization
Rate image: ๐Ÿ‘ ๐Ÿ‘Ž
AI Summary
  • The Dynamic MoE approach effectively addresses the loss of plasticity problem in neural networks by dynamically adding new experts to the MoE layer. [3]
  • Dynamic MoE outperforms prior network expansion methods in both synthetic continual learning and open-world environment settings, with a significant improvement in performance when compared to baseline models. [3]
  • The router weight visualization reveals that distinct distributions are allocated to newly added experts, highlighting the potential of Dynamic MoE to maintain plasticity. [3]
  • Mixture-of-Experts (MoE) architecture: a type of neural network architecture that specializes experts for distinct distributions. [3]
  • Dynamic MoE: a variant of the MoE architecture that dynamically adds new experts to the MoE layer, allowing for continuous learning and adaptation. [3]
  • Plasticity-stability dilemma: the trade-off between maintaining plasticity (the ability to learn and adapt) and stability (the ability to retain previously learned knowledge). [3]
  • Dynamic MoE is a promising approach for addressing the loss of plasticity problem in neural networks, with significant improvements in performance compared to prior network expansion methods. [3]
  • The capacity of Dynamic MoE is not wasted on overfitting early distributions, allowing it to maintain plasticity even in continually shifting environments. [2]
Abstract
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
Why we think this paper is great for you:
This work explores how models can adapt to changing data environments using specialized architectures, which aligns perfectly with your focus on robust and adaptable deep learning systems. It offers insights into building resilient models.
Abstract
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
Why we think this paper is great for you:
This paper presents an innovative approach to enhance generative models by combining specialized components, directly connecting with your interest in advanced generative techniques and modular architectures. It offers a unique perspective on improving model performance.
Paper visualization
Rate image: ๐Ÿ‘ ๐Ÿ‘Ž
AI Summary
  • An optimised deep learning model for dynamic market behaviour prediction has been proposed. [3]
  • The model performs better at capturing market complexity and uncertainty than traditional methods such as ARIMA, SVR, and random forest. [3]
  • The integration of advanced neural network architecture and reinforcement learning methods enables the model to demonstrate enhanced efficiency in resource allocation and profit maximisation. [3]
  • Further research should concentrate on enhancing the scalability and interpretability of the models in order to facilitate their use with larger and more diverse datasets. [3]
  • Entropy: A measure of the complexity and uncertainty inherent in market behaviour. [3]
  • Learning Rate: The rate at which the model converges during training. [3]
  • The proposed deep learning model has been shown to outperform traditional methods such as ARIMA, SVR, and random forest in terms of predictive power. [3]
  • The integration of sophisticated reinforcement learning techniques, such as multi-agent systems or meta-learning, is anticipated to enhance the models' capacity to adapt to the dynamic nature of evolving markets. [2]
  • Maximum Iterations: The maximum number of iterations allowed for the model to converge. [1]
Abstract
The advent of financial technology has witnessed a surge in the utilization of deep learning models to anticipate consumer conduct, a trend that has demonstrated considerable potential in enhancing lending strategies and bolstering market efficiency. We study multi-horizon demand forecasting on e-commerce transactions using the UCI Online Retail II dataset. Unlike prior versions of this manuscript that mixed financial-loan narratives with retail data, we focus exclusively on retail market behavior and define a clear prediction target: per SKU daily demand (or revenue) for horizons H=1,7,14. We present a hybrid sequence model that combines multi-scale temporal convolutions, a gated recurrent module, and time-aware self-attention. The model is trained with standard regression losses and evaluated under MAE, RMSE, sMAPE, MASE, and Theil's U_2 with strict time-based splits to prevent leakage. We benchmark against ARIMA/Prophet, LSTM/GRU, LightGBM, and state-of-the-art Transformer forecasters (TFT, Informer, Autoformer, N-BEATS). Results show consistent accuracy gains and improved robustness on peak/holiday periods. We further provide ablations and statistical significance tests to ensure the reliability of improvements, and we release implementation details to facilitate reproducibility.
Why we think this paper is great for you:
This paper focuses on refining the performance of predictive models, which is directly applicable to your interest in enhancing deep learning systems. It provides practical insights into improving model efficiency and accuracy in real-world scenarios.
Paper visualization
Rate image: ๐Ÿ‘ ๐Ÿ‘Ž
AI Summary
  • Some notable advancements include progressive distillation, test-time scaling, and inference-time steering. [3]
  • Diffusion model: A type of generative model that learns to sample data from a probability distribution by iteratively refining an initial noise signal. [3]
  • Score-based generative modeling: A technique for learning the underlying probability distribution of data using stochastic differential equations. [3]
  • PixelCNN decoder: A type of neural network architecture used for image generation and processing. [3]
  • Diffusion models can be computationally expensive and require large amounts of memory. [3]
  • Diffusion models have shown great promise in image synthesis tasks, but there is still room for improvement in terms of efficiency and quality. [2]
  • Diffusion models have become a popular choice for image synthesis tasks. [1]
Abstract
Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.
Why we think this paper is great for you:
This paper offers a novel method to improve generative model quality without extensive retraining, aligning with your interest in efficient and effective generative techniques. It explores how to optimize these models for better output.
Paper visualization
Rate image: ๐Ÿ‘ ๐Ÿ‘Ž
AI Summary
  • Task Arithmetic is the only method that reliably produces constructive interference in LLMs, improving upon both the base model and all individual checkpoints. [3]
  • The lack of orthogonal task structure assumed by subspace-boosting methods is a key factor in their failure. [3]
  • LLMs: Large Language Models Constructive interference: When the merged model performs better than both the base model and all individual checkpoints. [3]
  • Model merging techniques for large language models (LLMs) are not universally reliable and should be applied cautiously. [2]
Abstract
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
Why we think this paper is great for you:
This paper investigates methods for combining different model versions to achieve better performance, which is highly relevant to your work with advanced language models. It provides a comprehensive look at improving these powerful systems.
Deep Learning Architectures
Abstract
This paper argues that DNNs implement a computational Occam's razor -- finding the `simplest' algorithm that fits the data -- and that this could explain their incredible and wide-ranging success over more traditional statistical methods. We start with the discovery that the set of real-valued function $f$ that can be $ฮต$-approximated with a binary circuit of size at most $cฮต^{-ฮณ}$ becomes convex in the `Harder than Monte Carlo' (HTMC) regime, when $ฮณ>2$, allowing for the definition of a HTMC norm on functions. In parallel one can define a complexity measure on the parameters of a ResNets (a weighted $\ell_1$ norm of the parameters), which induce a `ResNet norm' on functions. The HTMC and ResNet norms can then be related by an almost matching sandwich bound. Thus minimizing this ResNet norm is equivalent to finding a circuit that fits the data with an almost minimal number of nodes (within a power of 2 of being optimal). ResNets thus appear as an alternative model for computation of real functions, better adapted to the HTMC regime and its convexity.
AI Summary
  • The HTMC norm is a measure of the complexity of a function, and it has several useful properties, including compositionality and convexity. [3]
  • The construction of the ResNet involves two main parts: first, the input is mapped to the weighted binary representations of its surrounding vertices; second, a sorting algorithm is used to recover the simplex that contains the input. [3]
  • The Lipschitz constant of this network is bounded by cp doutb|C|p=2/3. [3]
  • The ResNet representation of Tetrakis functions has several useful properties, including compositionality and convexity. [3]
  • HTMC norm: a measure of the complexity of a function. [3]
  • Hยจ older continuous: a property of a function that implies it can be represented as a sum of Tetrakis functions. [3]
  • Tetrakis function: a type of function that is both HTMC computable and Hยจ older continuous. [3]
  • ResNet: a type of neural network that can represent functions that are both HTMC computable and Hยจ older continuous. [3]
  • ResNets can be used to represent functions that are both HTMC computable and Hยจ older continuous. [2]
Abstract
In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters. This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.
AI Summary
  • The field of quantum machine learning is rapidly evolving with new techniques and methods being developed to improve the performance of quantum neural networks. [3]
  • Quantum reinforcement learning has been gaining attention in recent years due to its potential applications in various fields such as finance, healthcare, and transportation. [3]
  • Quantum autoencoders have shown promise in denoising quantum data and reducing the dimensionality of high-dimensional quantum systems. [3]
  • The use of genetic algorithms and other optimization techniques is becoming increasingly popular in the field of quantum machine learning. [3]
  • Quantum Machine Learning: A subfield of artificial intelligence that uses quantum computing to improve the performance of machine learning models. [3]
  • Quantum Reinforcement Learning: A type of reinforcement learning where an agent learns to make decisions by interacting with a quantum environment. [3]
  • Quantum Autoencoders: Neural networks that learn to compress and reconstruct high-dimensional quantum data. [3]
  • The field of quantum machine learning is rapidly evolving with new techniques and methods being developed to improve the performance of quantum neural networks. [3]
  • Differentiable quantum architecture search (QAS) has emerged as a promising approach for designing efficient quantum circuits. [1]
Deep Learning
Abstract
In a study, published in \emph{Nature}, researchers from DeepMind and mathematicians demonstrated a general framework using machine learning to make conjectures in pure mathematics. Their work uses neural networks and attribution techniques to guide human intuition towards making provable conjectures. Here, we build upon this framework to develop a method for identifying sufficient conditions that imply a given mathematical statement. Our approach trains neural networks with a custom loss function that prioritizes high precision. Then uses attribution techniques and exploratory data analysis to make conjectures. As a demonstration, we apply this process to Stanley's problem of $e$-positivity of graphs--a problem that has been at the center of algebraic combinatorics for the past three decades. Guided by AI, we rediscover that one sufficient condition for a graph to be $e$-positive is that it is co-triangle-free, and that the number of claws is the most important factor for $e$-positivity. Based on the most important factors in Saliency Map analysis of neural networks, we suggest that the classification of $e$-positive graphs is more related to continuous graph invariants rather than the discrete ones. Furthermore, using neural networks and exploratory data analysis, we show that the claw-free and claw-contractible-free graphs with $10$ and $11$ vertices are $e$-positive, resolving a conjecture by Dahlberg, Foley, and van Willigenburg.
AI Summary
  • The authors used a precision-optimized model to identify the top four features that impact e-positivity in graphs. [3]
  • The model achieved 100% precision on the test set, indicating high reliability for its positive predictions. [3]
  • The study demonstrates how AI can be used to guide human intuition and advance mathematics by identifying underlying patterns associated with e-positivity. [3]
  • The authors' approach can be applied to other areas of mathematics where pattern recognition is crucial. [3]
  • E-positivity: a property of graphs that refers to the existence of certain combinatorial structures, such as cycles or paths. [3]
  • Chromatic symmetric function: a polynomial invariant of graphs that encodes information about their coloring properties. [3]
  • The study demonstrates the potential of AI in advancing mathematics by identifying underlying patterns associated with e-positivity. [3]
  • The precision-optimized model achieved high reliability for its positive predictions, indicating that it can be trusted with high confidence when classifying graphs as e-positive. [3]
  • The approach used in this study can be applied to other areas of mathematics where pattern recognition is crucial. [3]
  • Saliency Map analysis: a technique used to identify the most important features or variables in a dataset by computing the average gradient of the model's output with respect to its input features. [2]
Multimodal Learning
Abstract
The new educational models such as smart learning environments use of digital and context-aware devices to facilitate the learning process. In this new educational scenario, a huge quantity of multimodal students' data from a variety of different sources can be captured, fused, and analyze. It offers to researchers and educators a unique opportunity of being able to discover new knowledge to better understand the learning process and to intervene if necessary. However, it is necessary to apply correctly data fusion approaches and techniques in order to combine various sources of multimodal learning analytics (MLA). These sources or modalities in MLA include audio, video, electrodermal activity data, eye-tracking, user logs, and click-stream data, but also learning artifacts and more natural human signals such as gestures, gaze, speech, or writing. This survey introduces data fusion in learning analytics (LA) and educational data mining (EDM) and how these data fusion techniques have been applied in smart learning. It shows the current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends, and challenges in this specific research area.
Large Language Models
Abstract
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Deep Learning Models
You can edit or add more interests any time.