Hi!
Your personalized paper recommendations for 19 to 23 January, 2026.
University of Virginia
AI Insights - Editing precision ratio (EPR): A metric used to evaluate the effectiveness of semantic editing methods, measuring the ratio of correctly edited attributes to the total number of attributes. (ML: 0.97)šš
- Concept alignment: The process of aligning the sparse latent representation with the target concept, allowing for controlled semantic modification. (ML: 0.97)šš
- Learned perceptual image patch similarity (LPIPS): A metric used to evaluate the perceptual distortion introduced by semantic editing methods. (ML: 0.97)šš
- A concept-mapping linear layer is used to align the sparse latent representation with the target concept, allowing for controlled semantic modification. (ML: 0.96)šš
- The paper proposes a method for semantic editing using sparse autoencoders and concept alignment. (ML: 0.95)šš
- The method is evaluated on several datasets and shows state-of-the-art results in terms of editing precision ratio (EPR) and learned perceptual image patch similarity (LPIPS). (ML: 0.94)šš
- The proposed method achieves state-of-the-art results in terms of EPR and LPIPS, demonstrating its effectiveness for semantic editing tasks. (ML: 0.93)šš
- Sparse autoencoder: A neural network that learns to represent input data using a sparse latent representation. (ML: 0.93)šš
- The use of sparse autoencoders and concept alignment enables controlled and interpretable semantic modification, allowing for a range of applications in image editing and generation. (ML: 0.92)šš
- The method involves training a sparse autoencoder on the bottleneck activations of a U-Net denoiser, which captures high-level semantics in the image. (ML: 0.81)šš
Abstract
Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging. While Sparse Autoencoders (SAEs) have shown promise in disentangling latent representations, existing SAE-based methods for diffusion model understanding rely on unsupervised approaches that fail to align sparse features with human-understandable concepts. This limits their ability to provide reliable semantic control over generated images. We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts. CASL first trains an SAE on frozen U-Net activations to obtain disentangled latent representations, and then learns a lightweight linear mapping that associates each concept with a small set of relevant latent dimensions. To validate the semantic meaning of these aligned directions, we propose CASL-Steer, a controlled latent intervention that shifts activations along the learned concept axis. Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content. We further introduce the Editing Precision Ratio (EPR), a metric that jointly measures concept specificity and the preservation of unrelated attributes. Experiments show that our method achieves superior editing precision and interpretability compared to existing approaches. To the best of our knowledge, this is the first work to achieve supervised alignment between latent representations and semantic concepts in diffusion models.
Why we are recommending this paper?
Due to your Interest in Diffusion Models
This paper directly addresses the challenge of interpreting diffusion models, aligning with the user's interest in understanding deep learning models. The use of sparse autoencoders to disentangle latent representations is a key technique within the broader domain of deep learning architectures.
CISPA Helmholtz Center for Information Security
AI Insights - The paper also discusses the implications of their findings on the design of MoE models. (ML: 0.94)šš
- They derive the Bayes optimal estimators for both cases and provide a proof of Theorem 4.2. (ML: 0.89)šš
- The authors analyze the generalization error of MoE under different scenarios, including dense and sparse cases. (ML: 0.87)šš
- The paper discusses the robustness of Mixtures of Experts (MoE) to feature noise. (ML: 0.87)šš
- MoE is a type of neural network architecture that combines multiple experts to make predictions. (ML: 0.85)šš
Abstract
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.
Why we are recommending this paper?
Due to your Interest in Mixture of Experts
Given the user's interest in Mixture of Experts models, this paper offers valuable insights into their robustness, a critical area for practical deployment. The focus on noisy internal activations is particularly relevant to understanding MoE behavior, aligning with their interest in deep learning models.
Uppsala University
AI Insights - SoftMoE: A variant of MoE models that uses a soft-gating mechanism to select the most relevant experts for each input. (ML: 0.95)šš
- MoE models can generalize robustly in moderate-scale vision tasks when appropriately regularized. (ML: 0.93)šš
- Mixture-of-Experts (MoE) models: A type of neural network architecture that combines multiple experts to make predictions. (ML: 0.92)šš
- Mixture-of-Experts (MoE) models can generalize robustly in moderate-scale vision tasks when appropriately regularized. (ML: 0.92)šš
- SparseMoE: A variant of MoE models that uses a sparse-gating mechanism to select only a subset of experts for each input. (ML: 0.92)šš
- SoftMoE and SparseMoE architectures outperform the dense baseline on validation accuracy when expert utilization is properly regularized. (ML: 0.92)šš
- SoftMoE and SparseMoE architectures outperform the dense baseline on validation accuracy when expert utilization is properly regularized. (ML: 0.92)šš
- Hessian-based curvature analysis: A method used to analyze the geometry of the loss surface in neural networks. (ML: 0.85)šš
- The gap between theoretical and realized efficiency in sparse MoE models arises from the overhead of routing, selection, and aggregation operations in naive implementations. (ML: 0.74)šš
- Hessian-based curvature analysis reveals that SoftMoE converges to solutions with higher local curvature, while Dense and SparseMoE occupy a similar sharpness regime. (ML: 0.74)šš
Abstract
Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.
Why we are recommending this paper?
Due to your Interest in Mixture of Experts
This paper investigates MoE models within a vision setting, directly addressing a significant area of interest for the user. The focus on routing and optimization within MoE architectures is a core component of their interest in Mixture of Experts.
McGill
AI Insights - More research is needed to fully understand the potential and limitations of LLMs in this context. (ML: 0.98)šš
- Different methods have been proposed to leverage the capabilities of LLMs for optimization, but more research is needed to fully understand their potential and limitations. (ML: 0.91)šš
- The paper discusses how large language models can be used in offline model-based optimization. (ML: 0.91)šš
- The paper discusses various methods for using large language models (LLMs) in offline model-based optimization. (ML: 0.90)šš
- This paper discusses various methods for using large language models (LLMs) in offline model-based optimization. (ML: 0.90)šš
- The paper cites several other papers that have explored different approaches to using LLMs for optimization, but more work needs to be done to fully understand their potential and limitations. (ML: 0.89)šš
- Several papers are cited that explore different approaches to leveraging LLMs for optimization, including importance-aware co-teaching and guided trajectory generation with diffusion models. (ML: 0.88)šš
- Several papers are cited that explore different approaches to leveraging LLMs for optimization, including importance-aware co-teaching and guided trajectory generation with diffusion models. (ML: 0.88)šš
- This is a new area of research where people are trying to figure out how to use these powerful models for optimization tasks. (ML: 0.83)šš
- The use of LLMs in offline model-based optimization is a rapidly evolving field. (ML: 0.79)šš
- The use of LLMs in offline model-based optimization is a rapidly evolving field. (ML: 0.79)šš
Abstract
Offline black-box optimization (BBO) aims to find optimal designs based solely on an offline dataset of designs and their labels. Such scenarios frequently arise in domains like DNA sequence design and robotics, where only a few labeled data points are available. Traditional methods typically rely on task-specific proxy or generative models, overlooking the in-context learning capabilities of pre-trained large language models (LLMs). Recent efforts have adapted autoregressive LLMs to BBO by framing task descriptions and offline datasets as natural language prompts, enabling direct design generation. However, these designs often contain bidirectional dependencies, which left-to-right models struggle to capture. In this paper, we explore diffusion LLMs for BBO, leveraging their bidirectional modeling and iterative refinement capabilities. This motivates our in-context denoising module: we condition the diffusion LLM on the task description and the offline dataset, both formatted in natural language, and prompt it to denoise masked designs into improved candidates. To guide the generation toward high-performing designs, we introduce masked diffusion tree search, which casts the denoising process as a step-wise Monte Carlo Tree Search that dynamically balances exploration and exploitation. Each node represents a partially masked design, each denoising step is an action, and candidates are evaluated via expected improvement under a Gaussian Process trained on the offline dataset. Our method, dLLM, achieves state-of-the-art results in few-shot settings on design-bench.
Why we are recommending this paper?
Due to your Interest in Large Language Models
This work explores the application of diffusion models to black-box optimization, a technique relevant to their interest in deep learning optimization. The use of diffusion models for designing solutions aligns with their broader interest in deep learning models.
University of Science and Technology of China
AI Insights - The paper presents a new method for training deep generative models using the power-uniform time discretization strategy. (ML: 0.86)šš
- The paper also presents experimental results demonstrating the effectiveness of the proposed method in various settings, including image synthesis and data imputation. (ML: 0.86)šš
- The proposed method is based on a novel formulation of the reverse sampling process that leverages the power-uniform discretization to improve the stability and efficiency of the training procedure. (ML: 0.81)šš
- power-uniform time discretization strategy reverse sampling process gamma (hyperparameter) KL divergence The power-uniform time discretization strategy is a novel approach to training deep generative models that offers improved stability and efficiency. (ML: 0.77)šš
- The proposed method leverages the power-uniform discretization to improve the reverse sampling process, leading to faster convergence and better performance. (ML: 0.69)šš
- The authors provide a theoretical analysis of the method, including a proof of convergence and an expression for the optimal hyperparameter gamma. (ML: 0.60)šš
Abstract
An elementary approach to characterizing the impact of noise scheduling and time discretization in generative diffusion models is developed. Considering a simplified model where the source distribution is multivariate Gaussian with a given covariance matrix, the explicit closed-form evolution trajectory of the distributions across reverse sampling steps is derived, and consequently, the Kullback-Leibler (KL) divergence between the source distribution and the reverse sampling output is obtained. The effect of the number of time discretization steps on the convergence of this KL divergence is studied via the Euler-Maclaurin expansion. An optimization problem is formulated, and its solution noise schedule is obtained via calculus of variations, shown to follow a tangent law whose coefficient is determined by the eigenvalues of the source covariance matrix. For an alternative scenario, more realistic in practice, where pretrained models have been obtained for some given noise schedules, the KL divergence also provides a measure to compare different time discretization strategies in reverse sampling. Experiments across different datasets and pretrained models demonstrate that the time discretization strategy selected by our approach consistently outperforms baseline and search-based strategies, particularly when the budget on the number of function evaluations is very tight.
Why we are recommending this paper?
Due to your Interest in Diffusion Models
This paper tackles the fundamental problem of noise scheduling in diffusion models, a core element of their interest in generative diffusion models. The focus on characterizing the impact of noise scheduling provides a solid foundation for understanding this important aspect of the models.
George Mason University
AI Insights - They used a dataset of 395 USD and fine-tuned the model on a smaller dataset. (ML: 0.96)šš
- The results showed that the model was able to answer questions correctly with high accuracy. (ML: 0.96)šš
- The authors trained a large language model called LLAMA-3.2-1B-INSTRUCT to answer multiple-choice questions. (ML: 0.95)šš
- The paper discusses the training and fine-tuning of a large language model, LLAMA-3.2-1B-INSTRUCT, for answering multiple-choice questions. (ML: 0.94)šš
Abstract
Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.
Why we are recommending this paper?
Due to your Interest in Large Language Models
Purdue University
AI Insights - The authors acknowledge that their study is limited by its reliance on a small number of datasets. (ML: 0.99)šš
- Domain shift: a phenomenon where the distribution of data in the training set differs from that of the testing set. (ML: 0.98)šš
- The development of DSCF highlights the need for large-scale and diverse training data. (ML: 0.97)šš
- Recent works have proposed unsupervised domain adaptation frameworks, but their effectiveness beyond the originally reported datasets are yet to be independently evaluated. (ML: 0.95)šš
- The results of this benchmarking experiment have shown that classifying test samples that are in-distribution to the training dataset is significantly easier than test samples suffering from distribution shift due to changes in instruments and acquisition conditions, and additional contaminants. (ML: 0.94)šš
- Foundation model: a pre-trained model that can be fine-tuned for specific tasks, often using transfer learning. (ML: 0.92)šš
- SANet demonstrated the best overall performance across the datasets. (ML: 0.84)šš
- The study benchmarks only five architectures and relies on minimal spectral pre-processing. (ML: 0.77)šš
- Existing open-source Raman datasets are often restricted in size, chemical diversity or experimental variability. (ML: 0.67)šš
- Creating large, curated experimental Raman spectral datasets that span multiple instruments, materials and measurement settings is key to developing a Raman-specific foundation model. (ML: 0.61)šš
- Raman spectroscopy: a technique used to analyze the vibrational modes of molecules. (ML: 0.52)šš
Abstract
Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.
Why we are recommending this paper?
Due to your Interest in Deep Learning Models
University of Agriculture Faisalabad
AI Insights - These models have improved accuracy and speed compared to traditional methods. (ML: 0.93)šš
- LADet is a lightweight and adaptable network for multi-scale object recognition that can handle the problems of scale variation and category imbalance. (ML: 0.85)šš
- YOLOv5: You Only Look Once v5, a lightweight and efficient object detection model RetinaNet: A one-stage detector that uses two sub-networks and a backbone design LADet: Lightweight and adaptable network for multi-scale object recognition SSD: Single Shot Detector, a single-shot detector that uses multi-reference and multi-scale representation The development of deep learning models has led to significant advancements in object detection tasks. (ML: 0.77)šš
- SSD is a single-shot detector that uses multi-reference and multi-scale representation to improve the accuracy of small object detection. (ML: 0.74)šš
- YOLOv5 is a lightweight and efficient object detection model that can handle various tasks such as person detection, vehicle detection, and pedestrian detection. (ML: 0.70)šš
- RetinaNet is a one-stage detector that uses two sub-networks and a backbone design to achieve high accuracy and speed in detecting objects at various scales. (ML: 0.68)šš
Abstract
Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.
Why we are recommending this paper?
Due to your Interest in Deep Learning Models
Georgia Institute of Technology
AI Insights - The results show that larger window sizes enhance test accuracy slightly, but delay LR decay. (ML: 0.91)šš
- The results show that larger window sizes enhance test accuracy slightly, but delay LR decay. (ML: 0.91)šš
- Assumption B.3: The loss function L is bounded below by a scalar Lā. (ML: 0.90)šš
- The ZENITH algorithm is an adaptive learning rate schedule that uses a sliding window mean to compute the step size. (ML: 0.86)šš
- The convergence result shows that the algorithm converges to a stationary point if the initial learning rate is chosen such that Ī·0ā¤2/M, where M is the Lipschitz constant of the gradient. (ML: 0.83)šš
- ZENITH outperforms the best baselines across a broad range of window sizes spanning two orders of magnitude. (ML: 0.83)šš
- ZENITH outperforms the best baselines across a broad range of window sizes spanning two orders of magnitude. (ML: 0.83)šš
- Assumption B.4: The gradient of the loss function is M-Lipschitz continuous. (ML: 0.83)šš
- Definition B.1 (ZENITH Algorithm): The ZENITH algorithm is defined as an adaptive learning rate schedule that uses a sliding window mean to compute the step size. (ML: 0.77)šš
- The algorithm has been theoretically proven to converge to a stationary point under certain assumptions, including smooth non-convex optimization and bounded loss function. (ML: 0.70)šš
- The effect of the window size on the performance of ZENITH was evaluated using experiments on CIFAR-100. (ML: 0.69)šš
- The ZENITH algorithm has been theoretically proven to converge to a stationary point under certain assumptions, and its performance was evaluated using experiments on CIFAR-100. (ML: 0.56)šš
Abstract
Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.
Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization
Quantinuum Ltd
AI Insights - L1 Relative Change (L1RC): A measure of the difference between two probability distributions. (ML: 0.98)šš
- Signal-to-Noise Ratio (SNR): The ratio of the signal power to the noise power in a system. (ML: 0.93)šš
- However, on Real Pauli data the advantage clearly shifts toward the ML-based models, which outperform all baselines in both median L1 relative change and fraction of improved circuits. (ML: 0.93)šš
- Deep learning models can learn corrections directly from data gathered during circuit runs, more easily capturing correlations. (ML: 0.88)šš
- The best performing models are comparable to the best baseline methods on Simulated data (both Pauli and Random). (ML: 0.87)šš
- It is defined as the L1 norm of the difference between the two distributions. (ML: 0.87)šš
- The learned mapping from P noisy and circuit features to P ideal captures a richer structure that goes beyond coarse depolarization or measurement-error mitigation. (ML: 0.81)šš
- The PERCEIVER model consistently achieves as good or greater median performance than the baseline mitigation techniques for Pauli circuits. (ML: 0.80)šš
- The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)šš
- The deep learning approaches can generalize across noise regimes, device generations, and circuit families without relying on a predefined noise model. (ML: 0.79)šš
- The baseline methods retain value as lightweight, interpretable mitigation techniques, particularly for structured, low-depth circuits. (ML: 0.61)šš
Abstract
We present a systematic investigation of deep learning methods applied to quantum error mitigation of noisy output probability distributions from measured quantum circuits. We compare different architectures, from fully connected neural networks to transformers, and we test different design/training modalities, identifying sequence-to-sequence, attention-based models as the most effective on our datasets. These models consistently produce mitigated distributions that are closer to the ideal outputs when tested on both simulated and real device data obtained from IBM superconducting quantum processing units (QPU) up to five qubits. Across several different circuit depths, our approach outperforms other baseline error mitigation techniques. We perform a series of ablation studies to examine: how different input features (circuit, device properties, noisy output statistics) affect performance; cross-dataset generalization across circuit families; and transfer learning to a different IBM QPU. We observe that generalization performance across similar devices with the same architecture works effectively, without needing to fully retrain models.
Why we are recommending this paper?
Due to your Interest in Deep Learning Optimization
William & Mary
Abstract
Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation--communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.
Why we are recommending this paper?
Due to your Interest in Multimodal Learning
Shanghai Jiao Tong University
AI Insights - The authors use the DataComp dataset as their pre-training source, which contains 1.1 billion image-text pairs. (ML: 0.93)šš
- The paper proposes a multi-task visual representation learning framework that combines self-supervised learning (SSL) with grounding and depth supervision. (ML: 0.90)šš
- For grounding supervision, they apply RAM++ and OWLv2-Base sequentially to extract salient entity names and localize the referenced regions. (ML: 0.88)šš
- SSL: Self-supervised learning CLOC: Contrastive Learning of Contextualized Representations ViT-B/16: Vision Transformer Base model with 16x downsampling ViT-L/16: Vision Transformer Large model with 16x downsampling (ML: 0.86)šš
- They adopt the MiDaS objective for depth supervision, combining a scale- and shift-invariant loss with a multi-scale gradient matching loss. (ML: 0.83)šš
- They adopt Depth Anything V2 to generate relative depth maps for each image, achieving a throughput of roughly 120K images per GPU-hour. (ML: 0.82)šš
- The authors use a lightweight Transformer encoder called Prompter to utilize the grounding supervision. (ML: 0.65)šš
Abstract
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
Why we are recommending this paper?
Due to your Interest in Multimodal Learning
Mlardalen University
AI Insights - It achieves higher accuracy with fewer parameters and demonstrates superior data efficiency, achieving higher accuracy with fewer client updates. (ML: 0.95)šš
- Pareto frontiers are plots that show the trade-off between different objectives, such as accuracy and parameters. (ML: 0.94)šš
- Federated neural architecture search (FedNAS) is a method for discovering optimal neural network architectures in a distributed manner. (ML: 0.87)šš
- Its ability to optimize subnets via a genetic algorithm guided by device-specific latency prediction models addresses the trade-off between accuracy and computational efficiency. (ML: 0.86)šš
- DeepFedNAS is a federated neural architecture search framework that eliminates the need for a predictor pipeline, reducing computational costs by 61x. (ML: 0.82)šš
- The framework optimizes subnets via a genetic algorithm guided by device-specific latency prediction models, addressing the trade-off between accuracy and computational efficiency. (ML: 0.81)šš
- DeepMAD is an algorithm for multi-objective optimization inspired by the concept of maximum entropy. (ML: 0.81)šš
- The DeepFedNAS framework demonstrates significant improvements in federated neural architecture search, including reduced computational costs, higher accuracy with fewer parameters, and superior data efficiency. (ML: 0.80)šš
- Ablation studies validate the individual contributions of the DeepMAD-inspired fitness components, confirming their effectiveness in preventing the selection of degenerate architectures. (ML: 0.78)šš
- Ablation studies validate the individual contributions of the DeepMAD-inspired fitness components, confirming that constraining the depth-to-width ratio prevents the selection of degenerate architectures. (ML: 0.75)šš
Abstract
Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a principled, multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal cache of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: https://github.com/bostankhan6/DeepFedNAS
Why we are recommending this paper?
Due to your Interest in Deep Learning Architectures
Interests not found
We did not find any papers that match the below interests.
Try other terms also consider if the content exists in arxiv.org.
š¬ Help Shape Our Pricing
We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.
Share Your Feedback
Help us improve your experience!
This project is on its early stages your feedback can be pivotal on the future of the project.
Let us know what you think about this week's papers and suggestions!
Give Feedback