Hi!

Your personalized paper recommendations for 02 to 06 February, 2026.
University of California, Berkeley
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The paper provides a theoretical framework for understanding diffusion models and proposes an experiment to validate it. (ML: 0.96)👍👎
  • The authors propose a minimal image experiment using MNIST images to validate their theoretical framework. (ML: 0.86)👍👎
  • The results of the experiment are expected to provide insights into the behavior of diffusion models. (ML: 0.83)👍👎
  • MNIST Synchronization Experiment: An experiment designed to validate the theoretical framework using MNIST images. (ML: 0.82)👍👎
  • The paper discusses a theoretical framework for understanding the behavior of diffusion models in terms of two distinct transitions: speciation and collapse. (ML: 0.81)👍👎
  • Synchronization gap: A period where the global structure is already decided while modality-specific discrepancies remain unstable. (ML: 0.81)👍👎
  • The experiment involves training an unconditional ε-prediction diffusion model on a two-channel state, with the goal of observing mode ordering during the reverse process. (ML: 0.80)👍👎
  • The experiment involves training a model on a two-channel state and observing mode ordering during the reverse process. (ML: 0.80)👍👎
  • Speciation: The transition from regime I (high noise) to regime II (clustered structure). (ML: 0.76)👍👎
  • Collapse: The transition from regime II to regime III (condensation). (ML: 0.75)👍👎
Abstract
Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.
Why we are recommending this paper?
Due to your Interest in Diffusion Models

This paper directly addresses diffusion models, a core interest for the user, and explores their behavior in multimodal contexts. Understanding the theoretical mechanisms of coupled diffusion models is crucial for advancing research in this area.
NEC Laboratories Europe
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • It's like having a set of rules that tell the model what to focus on when making decisions. (ML: 0.98)👍👎
  • Logical guidance: A method for constructing posterior coefficients and logical scores using a recursive construction of the circuit of the formula. (ML: 0.98)👍👎
  • It also requires a finite taxonomy of propositions, which may not be the case in all domains. (ML: 0.97)👍👎
  • Logical guidance for diffusion models This research paper is about developing a new method to guide machine learning models called diffusion models. (ML: 0.97)👍👎
  • The new method uses logic to construct coefficients that help the model make better predictions. (ML: 0.97)👍👎
  • These models are used in image synthesis, natural language processing, and computer vision tasks. (ML: 0.95)👍👎
  • It also provides additional theoretical results, including a taxonomy query that admits an exact logical guidance rule under their framework. (ML: 0.95)👍👎
  • The paper builds on previous work on diffusion models and logical guidance. (ML: 0.95)👍👎
  • The framework is applicable to various domains, including image synthesis, natural language processing, and computer vision. (ML: 0.93)👍👎
  • It provides a new framework for constructing posterior coefficients and logical scores using a recursive construction of the circuit of the formula. (ML: 0.92)👍👎
  • The paper provides a framework for logical guidance of diffusion models, which can be used to construct posterior coefficients and logical scores using a recursive construction of the circuit of the formula. (ML: 0.92)👍👎
  • The paper discusses how to construct posterior coefficients and logical scores using a recursive construction of the circuit of the formula. (ML: 0.91)👍👎
  • The paper assumes conditional independence of subformulas, which may not always hold in practice. (ML: 0.89)👍👎
  • Taxonomy query: A set of allowed nodes in a hierarchy, interpreted as an event that specifies constraints at different levels of the hierarchy. (ML: 0.88)👍👎
  • The text appears to be a research paper on logical guidance for diffusion models in machine learning. (ML: 0.88)👍👎
  • Diffusion model: A type of deep learning model used for image synthesis and other tasks. (ML: 0.79)👍👎
Abstract
We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact score-based guidance for complex logical formulas can be obtained from guidance signals associated with atomic properties. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm. Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifierguidance and classifier-free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks.
Why we are recommending this paper?
Due to your Interest in Diffusion Models

Given the user's interest in diffusion models and Mixture of Experts, this paper's focus on constrained generation with logical expressions is highly relevant. It explores a novel approach to controlling diffusion model output, aligning with the user's desire to understand advanced generation techniques.
Old Dominion University
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • LLM+RAG), but an optimization would also benefit from isolating where the improvements come from better evidence, more relevant evidence, or changing the prompt to use a RAG. (ML: 0.97)👍👎
  • The impact of a RAG can be measured in terms of improving the overall LLM pipeline (e.g., comparing LLM only vs. (ML: 0.96)👍👎
  • RAGs have not been used primarily to correct erroneous outputs, but rather to ground the model and simulation within a context, which usually comes from an external corpus. (ML: 0.93)👍👎
  • There is awareness that using RAG is not a binary switch: several aspects must be carefully chosen and prepared, such as chunk size and overlap, embedding model choice and dimensionality, sensitivity to k when choosing top-k, similarity metric choice, and so on. (ML: 0.92)👍👎
  • top-p: The probability threshold for selecting tokens from the modified distribution. (ML: 0.92)👍👎
  • Preparing a corpus for RAG involves several steps, including chunking, de-duplication, metadata enrichment, and filtering, which can improve retrieval precision and reduce bias in generation. (ML: 0.91)👍👎
  • The use of Retrieval-Augmented Generation (RAG) in modeling and simulation has been increasing, but its reporting and optimization varies across studies. (ML: 0.91)👍👎
  • top-k: The number of most relevant documents retrieved by the retriever. (ML: 0.89)👍👎
  • LLM: Large Language Model BM25: A keyword-based retrieval method that uses a combination of term frequency and inverse document frequency to rank documents. (ML: 0.89)👍👎
  • Retrieval-Augmented Generation (RAG): A pipeline that combines the strengths of retrieval-based methods with the flexibility of generative models. (ML: 0.85)👍👎
Abstract
Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling & Simulation (M&S) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of M&S data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on M&S applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.
Why we are recommending this paper?
Due to your Interest in Large Language Models

This paper’s exploration of Large Language Models within Modeling & Simulation directly addresses a significant area of interest for the user. It provides a valuable overview of LLM applications in this domain, offering insights into current practices and challenges.
The University of Warwick
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The planning mechanism is often accompanied by other behaviors such as self-improvement, self-evaluation, and self-prompting. (ML: 0.98)👍👎
  • The ability to plan ahead is not limited to specific tasks or domains, and can be observed in various language models trained on different datasets. (ML: 0.97)👍👎
  • The findings of this study highlight the importance of considering the emergent properties of complex systems, rather than just their individual components. (ML: 0.96)👍👎
  • Planning Ahead: The ability of a language model to anticipate and prepare for future tokens or events in a conversation or task. (ML: 0.96)👍👎
  • The study of large language models' planning abilities has significant implications for the development of more sophisticated and human-like AI systems. (ML: 0.95)👍👎
  • Large Language Models (LLMs): A type of artificial intelligence model that uses natural language processing to generate human-like text. (ML: 0.95)👍👎
  • Large language models are capable of planning ahead for future tokens. (ML: 0.94)👍👎
  • The planning mechanism is not a fixed property of the model, but rather an emergent behavior that arises from the interactions between different components of the model. (ML: 0.92)👍👎
  • Further research is needed to fully understand the mechanisms underlying this behavior and to explore its potential applications in various domains. (ML: 0.79)👍👎
  • Emergent Behavior: A property or behavior that arises from the interactions between different components of a system, rather than being explicitly programmed. (ML: 0.74)👍👎
Abstract
Large language models (LLMs) have been shown to acquire sequence-level planning abilities during training, yet their planning behavior exhibited at inference time often appears short-sighted and inconsistent with these capabilities. We propose a Bayesian account for this gap by grounding planning behavior in the evolving generative context: given the subtle differences between natural language and the language internalized by LLMs, accumulated self-generated context drives a planning-shift during inference and thereby creates the appearance of compromised planning behavior. We further validate the proposed model through two controlled experiments: a random-generation task demonstrating constrained planning under human prompts and increasing planning strength as self-generated context accumulates, and a Gaussian-sampling task showing reduced initial bias when conditioning on self-generated sequences. These findings provide a theoretical explanation along with empirical evidence for characterizing how LLMs plan ahead during inference.
Why we are recommending this paper?
Due to your Interest in Large Language Models

The paper investigates the planning capabilities of LLMs, a key area of interest for the user. Understanding the limitations of LLM planning at inference time is critical for developing robust and reliable systems.
Southeast University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The paper presents two innovative frameworks—Multi-Expert LDL and Pattern-Aware LDL-MoE—that advance probabilistic time series forecasting by unifying accurate prediction with interpretable uncertainty quantification. (ML: 0.91)👍👎
  • The Multi-Expert LDL framework demonstrates the superiority of continuous distribution modeling, achieving state-of-the-art performance (RMSE: 3.311, MAE: 2.919) through specialized LSTM experts that capture diverse uncertainty patterns. (ML: 0.91)👍👎
  • This work presents two innovative frameworks—Multi-Expert LDL and Pattern-Aware LDL-MoE—that advance probabilistic time series forecasting by unifying accurate prediction with interpretable uncertainty quantification. (ML: 0.91)👍👎
  • Multi-Expert LDL: A framework that uses continuous distribution modeling to achieve state-of-the-art performance in probabilistic time series forecasting. (ML: 0.90)👍👎
  • The Pattern-Aware variant extends this capability by explicitly decomposing forecasts into interpretable temporal components (trend, seasonality, changepoints, and volatility), enabling practitioners to both predict outcomes and understand their underlying drivers. (ML: 0.90)👍👎
  • The success of these approaches stems from their ability to automatically adapt to different temporal regimes while maintaining computational efficiency through careful architectural design. (ML: 0.89)👍👎
  • Pattern-Aware LDL-MoE: An extension of the Multi-Expert LDL framework that explicitly decomposes forecasts into interpretable temporal components (trend, seasonality, changepoints, and volatility). (ML: 0.89)👍👎
  • The paper cites several relevant studies on time series forecasting, including [1] Yoshua Bengio et al. (ML: 0.88)👍👎
  • Current limitations in computational overhead and sequence length handling point to valuable future research directions. (ML: 0.87)👍👎
  • Current limitations in computational overhead and sequence length handling point to valuable future research directions. (ML: 0.87)👍👎
  • (2013) and [2] Rahul Dey and Fathi M Salem (2017). (ML: 0.65)👍👎
Abstract
Time series forecasting in real-world applications requires both high predictive accuracy and interpretable uncertainty quantification. Traditional point prediction methods often fail to capture the inherent uncertainty in time series data, while existing probabilistic approaches struggle to balance computational efficiency with interpretability. We propose a novel Multi-Expert Learning Distributional Labels (LDL) framework that addresses these challenges through mixture-of-experts architectures with distributional learning capabilities. Our approach introduces two complementary methods: (1) Multi-Expert LDL, which employs multiple experts with different learned parameters to capture diverse temporal patterns, and (2) Pattern-Aware LDL-MoE, which explicitly decomposes time series into interpretable components (trend, seasonality, changepoints, volatility) through specialized sub-experts. Both frameworks extend traditional point prediction to distributional learning, enabling rich uncertainty quantification through Maximum Mean Discrepancy (MMD). We evaluate our methods on aggregated sales data derived from the M5 dataset, demonstrating superior performance compared to baseline approaches. The continuous Multi-Expert LDL achieves the best overall performance, while the Pattern-Aware LDL-MoE provides enhanced interpretability through component-wise analysis. Our frameworks successfully balance predictive accuracy with interpretability, making them suitable for real-world forecasting applications where both performance and actionable insights are crucial.
Why we are recommending this paper?
Due to your Interest in Mixture of Experts

This paper's focus on multi-expert systems and probabilistic time series forecasting aligns with the user's interest in Mixture of Experts and Deep Learning Models. The emphasis on uncertainty quantification is particularly relevant for robust forecasting applications.
Indian Institute of Technology Delhi
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Expert selection decisions are influenced by internal computational dependence rather than usage frequency. (ML: 0.99)👍👎
  • Gradient-based attribution: A method used to measure the influence of an expert's internal representation on selection decisions. (ML: 0.99)👍👎
  • Training induces a transition from exploratory to confident routing, with decreasing collaboration entropy and growing successor centralization. (ML: 0.97)👍👎
  • Intrinsic expert importance is measured via gradient-based attribution, while relational importance is measured via routing mass. (ML: 0.97)👍👎
  • The orchestrator's behavior is influenced by both intrinsic expert importance and relational importance. (ML: 0.94)👍👎
  • Routing mass: The total incoming routing mass derived from the conditional interaction matrix, reflecting an expert's structural position within the collaboration graph. (ML: 0.91)👍👎
  • Orchestrator: A system that manages the collaboration between multiple experts to achieve a common goal. (ML: 0.88)👍👎
  • Expert: An individual contributor within the orchestration framework, responsible for providing specific knowledge or skills. (ML: 0.85)👍👎
  • The study highlights the importance of understanding the behavior of orchestration frameworks and their underlying mechanisms. (ML: 0.79)👍👎
  • Prompt perturbation analysis serves as a causal probe of orchestrator behavior. (ML: 0.71)👍👎
Abstract
Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen-3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.
Why we are recommending this paper?
Due to your Interest in Mixture of Experts
Luxembourg Institute of Science and Technology
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Additionally, the generated neural network architectures may not always outperform state-of-the-art models in various tasks. (ML: 0.98)👍👎
  • They also discuss the limitations and challenges associated with this approach. (ML: 0.98)👍👎
  • This can help improve the performance of various machine learning tasks such as image classification, object detection, and natural language processing. (ML: 0.97)👍👎
  • However, it relies heavily on the capabilities of LLMs, which may not be available to all researchers or practitioners. (ML: 0.97)👍👎
  • The proposed method relies heavily on the capabilities of LLMs, which may not be available to all researchers or practitioners. (ML: 0.97)👍👎
  • The paper proposes a method for generating neural network architectures using large language models (LLMs). (ML: 0.92)👍👎
  • The authors cite several papers that demonstrate the effectiveness of using LLMs for generating neural network architectures. (ML: 0.92)👍👎
  • The authors demonstrate the effectiveness of their approach by generating neural network architectures that outperform state-of-the-art models in several tasks. (ML: 0.92)👍👎
  • The authors demonstrate the effectiveness of their approach by generating neural network architectures that outperform state-of-the-art models in several tasks. (ML: 0.92)👍👎
  • The paper presents a novel approach to generating neural network architectures using large language models (LLMs). (ML: 0.91)👍👎
  • The paper proposes a new way to generate neural network architectures using large language models (LLMs). (ML: 0.91)👍👎
  • The authors propose a method that leverages the capabilities of LLMs to generate neural network architectures, which can be used for various tasks such as image classification, object detection, and natural language processing. (ML: 0.91)👍👎
  • LLM: Large Language Model The proposed method for generating neural network architectures using LLMs is a promising approach that can be used to improve the performance of various machine learning tasks. (ML: 0.89)👍👎
  • The proposed method is based on a combination of two techniques: instruction-guided autoregressive neural network parameter generation and tabular data generation using agentic LLM methods. (ML: 0.85)👍👎
Abstract
Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring, and migration. These tools play a crucial role in guaranteeing both the correctness and maintainability of neural network architectures, helping to prevent implementation errors, simplify model updates, and ensure that complex networks can be reliably extended and reused. Yet, assessing their effectiveness remains challenging due to the lack of publicly diverse datasets of neural networks that would allow systematic evaluation. To address this gap, we leverage large language models (LLMs) to automatically generate a dataset of neural networks that can serve as a benchmark for validation. The dataset is designed to cover diverse architectural components and to handle multiple input data types and tasks. In total, 608 samples are generated, each conforming to a set of precise design choices. To further ensure their consistency, we validate the correctness of the generated networks using static analysis and symbolic tracing. We make the dataset publicly available to support the community in advancing research on neural network reliability and adaptability.
Why we are recommending this paper?
Due to your Interest in Deep Learning Models
University of Southern California
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
AI Insights
  • The study highlights the value of exact, interpretable subset selection objectives as cognitive models, with an emphasis on interpretability and theoretical transparency. (ML: 0.99)👍👎
  • Representativeness: The ability of an exemplar to summarize the distribution of the dataset. (ML: 0.99)👍👎
  • Hybrid objectives that combine representativeness and diversity further improve alignment with human judgments, supporting the idea that human teachers may implicitly trade off multiple pedagogical goals. (ML: 0.99)👍👎
  • The study relies on a specific dataset and may not generalize to other domains or populations. (ML: 0.99)👍👎
  • Selection strategies based on joint representativity provide a closer match to human behavior on both prototypicality and diversity measures than strategies that prioritize either individual prototypicality or mutual dissimilarity alone. (ML: 0.98)👍👎
  • Future work might model individual learners and test how strategies shift based on learning contexts, as well as explore connections to recent advances in synthetic dataset generation in machine learning. (ML: 0.98)👍👎
  • Human teaching behavior is best characterized by a structured balance between representativeness and diversity that shifts with resource constraints. (ML: 0.98)👍👎
  • Pedagogical goals: The objectives that a teacher aims to achieve when selecting exemplars for teaching, such as conveying category structure or promoting learning. (ML: 0.97)👍👎
  • Transformer-based representations outperform convolutional ones in predicting human behavior, suggesting that global self-attention may better capture the similarity relations humans use to evaluate exemplars on a continuum. (ML: 0.97)👍👎
  • Diversity: The ability of an exemplar to convey new information not already present in other selected exemplars. (ML: 0.96)👍👎
Abstract
Teaching requires distilling a rich category distribution into a small set of informative exemplars. Although prior work shows that humans consider both representativeness and diversity when teaching, the computational principles underlying these tradeoffs remain unclear. We address this gap by modeling human exemplar selection using neural network feature representations and principled subset selection criteria. Novel visual categories were embedded along a one-dimensional morph continuum using pretrained vision models, and selection strategies varied in their emphasis on prototypicality, joint representativeness, and diversity. Adult participants selected one to three exemplars to teach a learner. Model-human comparisons revealed that strategies based on joint representativeness, or its combination with diversity, best captured human judgments, whereas purely prototypical or diversity-based strategies performed worse. Moreover, transformer-based representations consistently aligned more closely with human behavior than convolutional networks. These results highlight the potential utility of dataset distillation methods in machine learning as computational models for teaching.
Why we are recommending this paper?
Due to your Interest in Deep Learning Models
Beijing University of Posts and Telecommunications
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Scale-invariant components: Components in neural networks that are invariant to scaling, such as BatchNorm layers. (ML: 0.92)👍👎
  • Architecture-aware updates/projections: Updates and projections that are aware of the architecture of the neural network, such as BatchNorm layers. (ML: 0.92)👍👎
  • Curvature-adaptive radial step sizing: An adaptive learning rate scheme that adjusts the step size based on the curvature of the loss function. (ML: 0.89)👍👎
  • Its performance is superior to other algorithms, including AdamW and AdamP, on both CIFAR-100 and modular-arithmetic Grokking tasks. (ML: 0.88)👍👎
  • The paper evaluates AdamO on CIFAR-100 and modular-arithmetic Grokking tasks, showing that it outperforms other optimization algorithms, including AdamW and AdamP. (ML: 0.84)👍👎
  • Orthogonal dynamics: A method for optimizing neural networks by decoupling the update rules for different dimensions. (ML: 0.81)👍👎
  • AdamO is designed to handle scale-invariant components in neural networks, such as BatchNorm, by using projections to suppress ineffective updates. (ML: 0.81)👍👎
  • AdamO is a robust and effective optimization algorithm for deep learning tasks, particularly those involving scale-invariant components. (ML: 0.71)👍👎
  • AdamO's performance is robust across a wide range of hyperparameters, making it easier to tune and use in practice. (ML: 0.67)👍👎
  • The paper proposes a new adaptive optimization algorithm called AdamO, which is fully decoupled orthogonal dynamics with curvature-adaptive radial step sizing and architecture-aware updates/projections. (ML: 0.61)👍👎
Abstract
Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push--pull interaction induces radial oscillations, injecting noise into Adam's second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam's adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization
Peking University
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Imagine you're trying to learn a new language. (ML: 0.98)👍👎
  • The paper presents a theoretical analysis of the pretrain-finetune paradigm for Large Language Models (LLMs), providing insights into when pretraining is critical and when finetuning alone can be sufficient. (ML: 0.96)👍👎
  • Then, you practice speaking with native speakers (finetuning) to improve your skills. (ML: 0.96)👍👎
  • LLM (Large Language Model): A type of neural network model that is trained on a large corpus of text data and can generate human-like language. (ML: 0.95)👍👎
  • LSTM (Long Short-Term Memory): A type of Recurrent Neural Network (RNN) that is designed to handle long-term dependencies in data. (ML: 0.93)👍👎
  • The paper builds on previous work on LLMs and pretrain-finetune paradigms, providing a comprehensive theoretical analysis of the approach. (ML: 0.93)👍👎
  • The paper assumes that the prior distribution G is known, which may not always be the case in real-world applications. (ML: 0.92)👍👎
  • The pretrain-finetune paradigm can be effective in LLM-style pipelines due to its ability to leverage abundant data, learn transferable structure, and adapt to new tasks with limited data. (ML: 0.91)👍👎
  • The paper presents a comprehensive theoretical analysis of Transformer learning in relevant settings, providing the first principled account of why pretrain-finetune paradigms can be effective in LLM-style pipelines. (ML: 0.91)👍👎
  • You start by learning basic phrases and grammar rules from a teacher (pretraining). (ML: 0.91)👍👎
  • The paper shows that this pretrain-finetune approach can be effective for Large Language Models, which are designed to understand and generate human-like language. (ML: 0.90)👍👎
Abstract
We consider small-data, large-scale decision problems in which a firm must make many operational decisions simultaneously (e.g., across a large product portfolio) while observing only a few, potentially noisy, data points per instance. Inspired by the success of large language models (LLMs), we propose a pretrain-then-finetune approach built on a designed Transformer model to address this challenge. The model is first pretrained on large-scale, domain-informed synthetic data that encode managerial knowledge and structural features of the decision environment, and is then fine-tuned on real observations. This new pipeline offers two complementary advantages: pretraining injects domain knowledge into the learning process and enables the training of high-capacity models using abundant synthetic data, while finetuning adapts the pretrained model to the operational environment and improves alignment with the true data-generating regime. While we have leveraged the Transformer's state-of-the-art representational capacity, particularly its attention mechanism, to efficiently extract cross-task structure, our approach is not an off-the-shelf application. Instead, it relies on problem-specific architectural design and a tailored training procedure to match the decision setting. Theoretically, we develop the first comprehensive error analysis regarding Transformer learning in relevant contexts, establishing nonasymptotic guarantees that validate the method's effectiveness. Critically, our analysis reveals how pretraining and fine-tuning jointly determine performance, with the dominant contribution governed by whichever is more favorable. In particular, finetuning exhibits an economies-of-scale effect, whereby transfer learning becomes increasingly effective as the number of instances grows.
Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization
University of Maryland, College Park
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The dataset used in the study may not be representative of real-world scenarios, as it is generated using a benchmark. (ML: 0.97)👍👎
  • The expert agent employed in the study has full access to the traffic scene and generates data trails with variations in scenario configurations, vehicles, initial positions, and trajectories. (ML: 0.95)👍👎
  • The use of expert agents to generate data trails with variations in scenario configurations, vehicles, initial positions, and trajectories is a key aspect of the study. (ML: 0.93)👍👎
  • Multi-Modal Collaborative Decision-Making: A framework that combines sensor data from multiple vehicles to make informed decisions. (ML: 0.90)👍👎
  • The proposed multi-modal collaborative decision-making framework can improve traffic safety and efficiency by combining sensor data from multiple vehicles. (ML: 0.90)👍👎
  • The paper presents a multi-modal collaborative decision-making framework for connected autonomous vehicles (CAVs) to improve traffic safety and efficiency. (ML: 0.90)👍👎
  • The dataset used in the study is generated using the AUTOCASTSIM benchmark, which features three complex and accident-prone traffic scenarios for CAVs. (ML: 0.84)👍👎
  • The proposed framework combines sensor data from multiple vehicles, including cameras and LiDAR sensors, to make informed decisions. (ML: 0.81)👍👎
  • AUTOCASTSIM Benchmark: A dataset generated using the AUTOCASTSIM benchmark, which features three complex and accident-prone traffic scenarios for CAVs. (ML: 0.71)👍👎
  • Connected Autonomous Vehicles (CAVs): Vehicles equipped with advanced sensors and communication systems that enable them to operate autonomously. (ML: 0.59)👍👎
Abstract
Multi-agent systems are increasingly equipped with heterogeneous multimodal sensors, enabling richer perception but introducing modality-specific and agent-dependent uncertainty. Existing multi-agent collaboration frameworks typically reason at the agent level, assume homogeneous sensing, and handle uncertainty implicitly, limiting robustness under sensor corruption. We propose Active Asymmetric Multi-Agent Multimodal Learning under Uncertainty (A2MAML), a principled approach for uncertainty-aware, modality-level collaboration. A2MAML models each modality-specific feature as a stochastic estimate with uncertainty prediction, actively selects reliable agent-modality pairs, and aggregates information via Bayesian inverse-variance weighting. This formulation enables fine-grained, modality-level fusion, supports asymmetric modality availability, and provides a principled mechanism to suppress corrupted or noisy modalities. Extensive experiments on connected autonomous driving scenarios for collaborative accident detection demonstrate that A2MAML consistently outperforms both single-agent and collaborative baselines, achieving up to 18.7% higher accident detection rate.
Why we are recommending this paper?
Due to your Interest in Multimodal Learning
Genentech
Rate paper: 👍 👎 ♥ Save
AI Insights
  • The authors emphasize the importance of establishing consensus on evaluation priorities, such as whether sample matching or downstream tasks should serve as the primary performance criterion. (ML: 0.99)👍👎
  • The evaluation framework used in this paper may not be realistic or representative of real-world scenarios. (ML: 0.98)👍👎
  • The authors evaluate their method on several datasets and show that it outperforms existing methods in terms of sample matching and downstream tasks such as imputation. (ML: 0.98)👍👎
  • Contrastive learning: A type of self-supervised learning that learns to distinguish between similar and dissimilar examples in the input data. (ML: 0.94)👍👎
  • The proposed method uses a combination of contrastive learning and self-supervised learning to learn representations that are invariant to variations in the input data. (ML: 0.94)👍👎
  • Self-supervised learning: A type of unsupervised learning where the model is trained on its own predictions rather than relying on labeled data. (ML: 0.93)👍👎
  • Weakly paired multimodal data: Data where different modalities (e.g. (ML: 0.90)👍👎
  • The proposed method shows promising results for weakly paired multimodal data, but further research is needed to fully understand its limitations and potential applications. (ML: 0.89)👍👎
  • The evaluation framework used in this paper highlights the need for more realistic simulation frameworks that can better guide future model development. (ML: 0.89)👍👎
  • RNA sequencing, imaging) are not perfectly aligned or synchronized. (ML: 0.86)👍👎
  • The paper presents a method for weakly paired multimodal data, which is a common challenge in single-cell RNA sequencing and other fields. (ML: 0.76)👍👎
Abstract
We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.
Why we are recommending this paper?
Due to your Interest in Multimodal Learning
Aristotle University of Thessaloniki
Rate paper: 👍 👎 ♥ Save
AI Insights
  • Convolutional Neural Network (CNN): A type of neural network that uses convolutional layers to extract features from input data. (ML: 0.94)👍👎
  • Parameterizable: The ability to define and modify parameters of a design, allowing for the creation of multiple variants from a single template. (ML: 0.86)👍👎
  • Embedded Deep Learning: The application of deep learning techniques in embedded systems, such as mobile devices or IoT sensors. (ML: 0.82)👍👎
  • Future work includes design modeling and multiobjective optimization techniques towards the automation of finding a suitable design point as required by the embedded deep learning application. (ML: 0.81)👍👎
  • FPGA (Field-Programmable Gate Array): An integrated circuit that can be programmed after manufacturing, allowing for the creation of custom digital circuits. (ML: 0.77)👍👎
  • High-Level Synthesis (HLS): A set of tools that allow designers to describe digital circuits at a high level of abstraction, using programming languages such as C or C++. (ML: 0.72)👍👎
  • The evaluation of the design proved the accelerator template flexibility to describe design configurations with a broad range of resources, power, and latency combinations. (ML: 0.72)👍👎
  • The design of a parameterizable convolutional neural network (CNN) FPGA accelerator architecture using high-level synthesis (HLS) tools is presented. (ML: 0.71)👍👎
  • Quantization in lower bit widths is key to achieving reduced resource usage, reduced power consumption, and increased performance in terms of latency. (ML: 0.70)👍👎
  • The work presents a comparison with related works for VGG-16 acceleration, showing that the proposed architecture outperforms Angel-Eye [6], [13] and fpgaConvNet [13] using a lower resource count and reporting lower power consumption. (ML: 0.63)👍👎
Abstract
Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
Why we are recommending this paper?
Due to your Interest in Deep Learning Architectures

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Deep Learning
You can edit or add more interests any time.