Hi!

Your personalized paper recommendations for 12 to 16 January, 2026.

Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

University of Waterloo

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The method requires careful tuning of hyperparameters, including the choice of kernel and bandwidth. [3]
The paper presents a novel approach called MMD Guidance for improving the quality of generated samples from diffusion models by leveraging the Maximum Mean Discrepancy (MMD) metric. [2]
The paper provides theoretical guarantees for the convergence of the MMD Guidance method and presents several numerical experiments demonstrating its effectiveness. [1]

Abstract
Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose MMD Guidance, a training-free mechanism that augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.

Why we are recommending this paper?
Due to your Interest in Diffusion Models

This paper directly addresses diffusion models, a core interest for the user, focusing on adaptation techniques. The use of Maximum Mean Discrepancy guidance is a key area of research within diffusion model training, aligning with the user’s deep learning interests.

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

National University of Singapore

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The paper discusses various advancements in natural language processing (NLP) and speech recognition. [3]
It highlights the development of large-scale models such as the LLaMA 3 herd of models and textually pretrained speech language models. [3]
The paper also covers advancements in speech recognition, including the development of large-scale ASR corpora like LibriHeavy and the use of conditional computation and automatic sharding to scale giant models. [3]
Researchers have been working on improving the efficiency and accuracy of NLP tasks, including generative spoken language modeling from raw audio and mixture-of-experts models. [2]
Gshard: a scaling giant models framework using conditional computation and automatic sharding. [1]

Abstract
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST

Why we are recommending this paper?
Due to your Interest in Mixture of Experts

The paper’s focus on Mixture of Experts (MoE) models, combined with multimodal learning, strongly aligns with the user’s stated interests in this area. Exploring MoST offers a direct path to understanding a cutting-edge approach to combining speech and text processing.

The Imperfective Paradox in Large Language Models

LMU Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

Telicity: A property of verbs that indicates whether an action is completed or not. [3]
The study relies heavily on existing research and does not provide any novel insights or contributions. [3]
The paper discusses the limitations of large language models (LLMs) in understanding telicity and aspectual class classification. [2]
The authors propose a new approach using semantic entropy to detect hallucinations in LLMs, which can be used for improving their performance on telicity classification. [1]

Abstract
Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.

Why we are recommending this paper?
Due to your Interest in Large Language Models

This paper investigates Large Language Models (LLMs), a central interest for the user, specifically examining their semantic understanding. The 'Imperfective Paradox' provides a compelling theoretical lens through which to analyze LLM behavior.

Are Language Models Models?

University of Maryland

Rate paper: 👍 👎 ♥ Save

AI Insights

The argument from amazingness, which suggests that language models must be operating in human-like ways because they perform well on certain tasks, is unreliable and unnecessary for their value in computational cognitive modeling. [3]
Algorithmic/representational level: concerns the mechanisms and representations used by a system to solve problems. [3]
Computational theory level: examines the underlying principles and algorithms that govern a system's behavior. [3]
Language models are not suitable as model systems at any of Marr's three levels: implementation, algorithmic/representational, and computational theory. [2]

Abstract
Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.

Why we are recommending this paper?
Due to your Interest in Large Language Models

This paper directly tackles the fundamental question of whether LLMs truly ‘model’ the world, a critical area of debate within the field. The exploration of Marr’s levels of analysis provides a structured approach to evaluating LLM capabilities, aligning with the user’s deep learning interests.

Optimising for Energy Efficiency and Performance in Machine Learning

University of Cambridge

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

They use the MOBO (Multi-fidelity Bayesian Optimization) algorithm to search for optimal hyperparameters. [3]
MOBO: Multi-fidelity Bayesian Optimization CNN: Convolutional Neural Network CIFAR-10: A dataset of images for image classification SOTA: State-of-the-Art MAC: Multiply-accumulate operation [3]
The authors present an approach to optimizing machine learning models for both performance and energy efficiency. [2]
However, the energy consumption of the optimized model is only 0.39 mJ, making it more energy-efficient than the state-of-the-art Spike Aggregation Transformer (SAFormer). [1]

Abstract
The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost -- ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.

Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization

Given the user’s interest in Deep Learning Optimization, this paper’s focus on the energy scaling laws of ML is highly relevant. Understanding the efficiency of ML models is a crucial aspect of modern deep learning research.

Quantitative weak propagation of chaos for McKean--Vlasov branching diffusion processes

The Chinese University of Hong Kong

Rate paper: 👍 👎 ♥ Save

AI Insights

The solution involves using mathematical equations and techniques to prove that there is a unique way for this process to change, which is important for understanding many real-world phenomena. [3]
The solution involves proving the existence and uniqueness of the environment measures (μ_t) for t ∈ [0, T]. [2]
The problem is about finding the solution to a nonlinear Fokker-Planck equation associated with a branching diffusion process. [1]

Abstract
We study in this paper the weak propagation of chaos for McKean--Vlasov diffusions with branching, whose induced marginal measures are nonnegative finite measures but not necessary probability measures. The flow of marginal measures satisfies a non-linear Fokker--Planck equation, along which we provide a functional Itô's formula. We then consider a functional of the terminal marginal measure of the branching process, whose conditional value is solution to a Kolmogorov backward master equation. By using Itô's formula and based on the estimates of second-order linear and intrinsic functional derivatives of the value function, we finally derive a quantitative weak convergence rate for the empirical measures of the branching diffusion processes with finite population.

Why we are recommending this paper?
Due to your Interest in Diffusion Models

Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints

SB Intuitions, Tokyo, Japan

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The authors use a scaling law to model the relationship between the number of parameters in the network and its performance, and then use this law to optimize the architecture of the network. [3]
The proposed method is applied to a specific type of neural network called the Mixture-of-Experts (MoE) architecture, which is designed for large-scale natural language processing tasks. [3]
Mixture-of-Experts (MoE): A type of neural network that consists of multiple experts, each of which is responsible for a specific subset of the input data. [3]
Scaling law: A mathematical model that describes the relationship between the number of parameters in a neural network and its performance. [3]
The proposed method can be used to optimize the architecture of large-scale neural networks, leading to improved performance and reduced computational requirements. [3]
The proposed method may not be applicable to all types of neural networks, and further research is needed to determine its limitations and generalizability. [3]
The paper proposes a method to optimize the architecture of large-scale neural networks using a combination of mathematical modeling and optimization techniques. [2]

Abstract
Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$). Moreover, $n_{exp}$ and $n_{topk}$ do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.

Why we are recommending this paper?
Due to your Interest in Mixture of Experts

Convergence of gradient flow for learning convolutional neural networks

LudwigMaximiliansUniversitt

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Linear convolutional neural network (CNN): A type of neural network where each layer consists of a set of filters that slide over the input data, performing a convolution operation. [3]
Gradient flow: The process of iteratively updating the parameters of a function to minimize its value. [3]
The paper discusses the convergence of gradient flows for linear convolutional neural networks (CNNs). [2]
The authors use a Riemannian geometric approach to analyze the convergence of these flows. [1]

Abstract
Convolutional neural networks are widely used in imaging and image recognition. Learning such networks from training data leads to the minimization of a non-convex function. This makes the analysis of standard optimization methods such as variants of (stochastic) gradient descent challenging. In this article we study the simplified setting of linear convolutional networks. We show that the gradient flow (to be interpreted as an abstraction of gradient descent) applied to the empirical risk defined via certain loss functions including the square loss always converges to a critical point, under a mild condition on the training data.

Why we are recommending this paper?
Due to your Interest in Deep Learning Models

Generalization Analysis and Method for Domain Generalization for a Family of Recurrent Neural Networks

University of Manitoba

Rate paper: 👍 👎 ♥ Save

AI Insights

The authors use a real dataset provided by OpenEI to train a deep LSTM architecture and evaluate its performance under domain shift. [3]
They compare their results with those obtained using a different generalization bound for DNNs. [3]
Interpretable model: A linear dynamical system that approximates the hidden-state evolution of a deep learning model. [3]
Domain-generalization method: An approach to improve the performance of a deep learning model under domain shift by modifying its parameters. [3]
OOD generalization error: The difference between the model's performance on out-of-distribution data and its performance on in-distribution data. [3]
The proposed generalization analysis and domain-generalization method are effective in improving the performance of deep learning models under domain shift. [3]
The interpretable model provides a useful tool for understanding the behavior of complex deep learning architectures. [3]
The results demonstrate the importance of considering domain shift when evaluating the performance of deep learning models. [3]
The paper presents a generalization analysis and a domain-generalization method for deep learning models, specifically LSTM networks, in the context of load forecasting. [2]
The proposed approach is based on an interpretable model that approximates the hidden-state evolution using linear dynamical systems. [1]

Abstract
Deep learning (DL) has driven broad advances across scientific and engineering domains. Despite its success, DL models often exhibit limited interpretability and generalization, which can undermine trust, especially in safety-critical deployments. As a result, there is growing interest in (i) analyzing interpretability and generalization and (ii) developing models that perform robustly under data distributions different from those seen during training (i.e. domain generalization). However, the theoretical analysis of DL remains incomplete. For example, many generalization analyses assume independent samples, which is violated in sequential data with temporal correlations. Motivated by these limitations, this paper proposes a method to analyze interpretability and out-of-domain (OOD) generalization for a family of recurrent neural networks (RNNs). Specifically, the evolution of a trained RNN's states is modeled as an unknown, discrete-time, nonlinear closed-loop feedback system. Using Koopman operator theory, these nonlinear dynamics are approximated with a linear operator, enabling interpretability. Spectral analysis is then used to quantify the worst-case impact of domain shifts on the generalization error. Building on this analysis, a domain generalization method is proposed that reduces the OOD generalization error and improves the robustness to distribution shifts. Finally, the proposed analysis and domain generalization approach are validated on practical temporal pattern-learning tasks.

Why we are recommending this paper?
Due to your Interest in Deep Learning Models

Combinatorial Optimization Augmented Machine Learning

Technical University of Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

Statistical model φw: A neural network that takes the input instance x as input and outputs parameters θ that interact with the CO-oracle. [3]
COAML is a framework that combines machine learning and optimization techniques to find the best solution. [3]
It's like having a super-smart assistant that can learn from data and make predictions to help you solve the problem efficiently. [3]
The paper discusses a framework called COAML (Combinatorial Optimization and Machine Learning) that integrates machine learning with combinatorial optimization problems. [2]

Abstract
Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.

Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization

UM-Text: A Unified Multimodal Model for Image Understanding

JDCOM

Rate paper: 👍 👎 ♥ Save

AI Insights

The paper proposes a new framework called AnyText2, which is capable of generating and editing visual text with customizable attributes. [2]
The paper does not provide a clear explanation of how the framework handles out-of-vocabulary words. [1]

Abstract
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

Why we are recommending this paper?
Due to your Interest in Multimodal Learning

Personalized Multimodal Feedback Using Multiple External Representations: Strategy Profiles and Learning in High School Physics

LudwigMaximiliansUniversitt Mnchen LMU Munich

Rate paper: 👍 👎 ♥ Save

AI Insights

The study found a small but significant positive association between the frequency with which students accessed elaborated feedback and their post-test scores in both cohorts. [3]
The time-resolved MER-patterns analysis did not reveal distinct time-resolved MER-patterns, as most students tended to either select no feedback or all three feedback types simultaneously. [3]
Time-aggregated clusters: Clusters of students that are grouped based on their overall frequency of feedback selection over time. [3]
The analysis revealed distinct patterns of feedback selection among students, which can inform the design of feedback systems to support student learning. [3]
The observational design of the study means that it is not possible to infer causality between the frequency of use of elaborated feedback and learning gains. [3]
The analysis relied on self-reported data from students, which may be subject to biases and errors. [3]
The analysis revealed three main MER-selection-strategies regarding elaborated feedback use: consistently selecting all three MERs for every feedback consultation, evenly consulting the three MERs during the use of the platform but not simultaneously, or giving more weight to verbal feedback over the other types. [2]
MER-patterns: The patterns of feedback selection made by students during their use of the platform. [1]

Abstract
Multiple external representations (MERs) and personalized feedback support physics learning, yet evidence on how personalized feedback can effectively integrate MERs remains limited. This question is particularly timely given the emergence of multimodal large language models. We conducted a 16-24 week observational study in high school physics (N=661) using a computer-based platform that provided verification and optional elaborated feedback in verbal, graphical and mathematical forms. Linear mixed-effects models and strategy-cluster analyses (ANCOVA-adjusted comparisons) tested associations between feedback use and post-test performance and moderation by representational competence. Elaborated multirepresentational feedback showed a small but consistent positive association with post-test scores independent of prior knowledge and confidence. Learners adopted distinct representation-selection strategies; among students with lower representational competence, using a diverse set of representations related to higher learning, whereas this advantage diminished as competence increased. These findings motivate adaptive feedback designs and inform intelligent tutoring systems capable of tailoring feedback elaboration and representational format to learner profiles, advancing personalized instruction in physics education.

Why we are recommending this paper?
Due to your Interest in Multimodal Learning

DR-Arena: an Automated Evaluation Framework for Deep Research Agents

National University of Singapore

Rate paper: 👍 👎 ♥ Save

AI Insights

The document describes a system called DR-Arena, which is designed to evaluate the performance of search agents. [3]
The system generates complex research tasks and evaluates the responses from two search agents based on their accuracy, comprehensiveness, formatting, and helpfulness. [3]
DR-Arena is a system that tests search agents' abilities by giving them complex research tasks. [3]
The system evaluates the answers from two search agents based on how accurate, comprehensive, and helpful they are. [3]
DR-Arena The document does not provide information about how the system handles cases where both search agents fail to find the correct entity. [2]

Abstract
As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Why we are recommending this paper?
Due to your Interest in Deep Learning

Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting

Niigata University

Rate paper: 👍 👎 ♥ Save

AI Insights

The double descent phenomenon in machine learning refers to the observation that as the model size and training data increase, the generalization error of a model can first decrease and then increase again. [3]
This phenomenon has been observed in various contexts, including linear regression, neural networks, and graph convolutional networks. [3]
The double descent curve is characterized by three phases: underfitting, overfitting, and double descent. [3]
In the underfitting phase, the model is too simple to capture the underlying patterns in the data, leading to poor generalization performance. [3]
In the overfitting phase, the model is too complex and captures noise in the data, also leading to poor generalization performance. [3]
The double descent phase occurs when the model size increases beyond a certain point, causing the model to start capturing the underlying patterns in the data again, but with increased capacity for overfitting. [3]
Some of these explanations include the bias-variance trade-off, the effect of noise on fitting linear regression models, and the role of regularization in mitigating double descent. [3]
Researchers have also proposed various methods to mitigate or understand the double descent phenomenon, including optimal regularization, early stopping, and multi-scale feature learning dynamics. [3]
These methods aim to balance the capacity of the model with its ability to generalize well to new data. [3]
The study of double descent has significant implications for machine learning research and practice. [3]
It highlights the importance of understanding the trade-offs between model complexity and generalization performance, and provides insights into how to design models that can generalize well to new data. [3]
However, the existing literature provides valuable insights into this phenomenon and highlights the importance of continued investigation in this area. [3]
Double Descent: A phenomenon where as the model size and training data increase, the generalization error of a model can first decrease and then increase again. [3]
Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor generalization performance. [3]
Overfitting: When a model is too complex and captures noise in the data, also leading to poor generalization performance. [3]
Double Descent Curve: A curve that characterizes the three phases of the double descent phenomenon: underfitting, overfitting, and double descent. [3]
The double descent phenomenon has been studied extensively in recent years, and various explanations have been proposed. [1]

Abstract
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.

Why we are recommending this paper?
Due to your Interest in Deep Learning

Enhancing LUT-based Deep Neural Networks Inference through Architecture and Connectivity Optimization

The University of Sydney

Rate paper: 👍 👎 ♥ Save

AI Insights

The provided text appears to be a research paper or dissertation on the topic of neural network pruning and re-growth, specifically focusing on the PolyLUT (Polynomial Lookup Table) architecture. [3]
The paper discusses various techniques for pruning and re-growing neural networks, including random connectivity, polynomial expansion, and truth table-based methods. [3]
Random connectivity: A method for generating random connections between neurons in a neural network. [3]
Polynomial expansion: A technique for expanding the number of weights in a neural network using polynomial functions. [3]
The paper presents experimental results demonstrating the effectiveness of the proposed methods in reducing computational complexity while maintaining accuracy. [2]
The authors propose a novel approach called PolyLUT-Add, which combines the benefits of both PolyLUT and adder-based architectures. [1]

Abstract
Deploying deep neural networks (DNNs) on resource-constrained edge devices such as FPGAs requires a careful balance among latency, power, and hardware resource usage, while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs -- such as LogicNets, PolyLUT, and NeuraLUT -- face two critical challenges: the exponential growth of LUT size and inefficient random sparse connectivity. This paper presents SparseLUT, a comprehensive framework that addresses these challenges through two orthogonal optimizations. First, we propose an architectural enhancement that aggregates multiple PolyLUT sub-neurons via an adder, significantly reducing LUT consumption by 2.0x-13.9x and lowering inference latency by 1.2x-1.6x, all while maintaining comparable accuracy. Building upon this foundation, we further introduce a non-greedy training algorithm that optimizes neuron connectivity by selectively pruning less significant inputs and strategically regrowing more effective ones. This training optimization, which incurs no additional area and latency overhead, delivers consistent accuracy improvements across benchmarks -- achieving up to a 2.13% gain on MNIST and 0.94% on Jet Substructure Classification compared to existing LUT-DNN approaches.

Why we are recommending this paper?
Due to your Interest in Deep Learning Architectures

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback