Hi!

Your personalized paper recommendations for 08 to 12 December, 2025.
🎯 Top Personalized Recommendations
Renmin University of Chin
AI Summary
  • The code provided is for the MoKLV (Memory-Augmented Transformer) model, specifically in training mode. [3]
  • The MLP class represents a multi-layer perceptron with three fully connected layers: ffn_gate, ffn_up, and ffn_down. [3]
  • The Layer class represents a transformer layer with attention, feed-forward network (FFN), and expert modules. [3]
  • The training hyperparameters are provided in a table at the end of the code. [3]
  • RMSNorm: Root Mean Square Normalization, a type of layer normalization used in the MoKLV model. [3]
  • MoKLV (Memory-Augmented Transformer): A transformer model that uses memory-augmentation to improve performance on long-range dependencies tasks. [2]
  • The pseudocode defines two classes: MLP and Layer, which are used to implement the MoKLV architecture. [1]
Abstract
Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.
Why we think this paper is great for you:
This paper explores Mixture of Experts, a key architecture aligning with the user’s interest in LLMs and their efficient deployment, particularly relevant given the focus on large language models.
Tongji University
AI Summary
  • The paper discusses three approaches to investigating language models (LLMs): externally verifiable outputs, internal mechanisms, and embedding-based quantitative analysis. [2]
Abstract
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
Why we think this paper is great for you:
Given the user’s interest in Large Language Models, this paper offers a systematic approach to their application, directly addressing the growing need for robust methodologies in language sciences.
Renmin University of Chin
AI Summary
  • The code provided is for the MoKLV (Memory-Augmented Transformer) model, specifically in training mode. [3]
  • The MLP class represents a multi-layer perceptron with three fully connected layers: ffn_gate, ffn_up, and ffn_down. [3]
  • The Layer class represents a transformer layer with attention, feed-forward network (FFN), and expert modules. [3]
  • The training hyperparameters are provided in a table at the end of the code. [3]
  • RMSNorm: Root Mean Square Normalization, a type of layer normalization used in the MoKLV model. [3]
  • MoKLV (Memory-Augmented Transformer): A transformer model that uses memory-augmentation to improve performance on long-range dependencies tasks. [2]
  • The pseudocode defines two classes: MLP and Layer, which are used to implement the MoKLV architecture. [1]
Abstract
Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.
Why we think this paper is great for you:
The paper’s focus on Mixture of Experts aligns directly with the user's interest in efficient LLM architectures and their deployment strategies.
Tongji University
AI Summary
  • The paper discusses three approaches to investigating language models (LLMs): externally verifiable outputs, internal mechanisms, and embedding-based quantitative analysis. [2]
Abstract
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
Why we think this paper is great for you:
This work provides a crucial framework for systematically applying LLMs, a key interest for the user within the field of language sciences.
National Technical Univer
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID
Why we think this paper is great for you:
The use of a Mixture-of-Experts framework to understand semantic attributes aligns with the user’s interest in understanding the inner workings of LLMs.
Renmin University of Chin
AI Summary
  • The code provided is for the MoKLV (Memory-Augmented Transformer) model, specifically in training mode. [3]
  • The MLP class represents a multi-layer perceptron with three fully connected layers: ffn_gate, ffn_up, and ffn_down. [3]
  • The Layer class represents a transformer layer with attention, feed-forward network (FFN), and expert modules. [3]
  • The training hyperparameters are provided in a table at the end of the code. [3]
  • RMSNorm: Root Mean Square Normalization, a type of layer normalization used in the MoKLV model. [3]
  • MoKLV (Memory-Augmented Transformer): A transformer model that uses memory-augmentation to improve performance on long-range dependencies tasks. [2]
  • The pseudocode defines two classes: MLP and Layer, which are used to implement the MoKLV architecture. [1]
Abstract
Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.
Why we think this paper is great for you:
This paper’s exploration of Mixture of Experts architectures is highly relevant to the user’s interest in efficient and scalable LLM deployment.
Tongji University
AI Summary
  • The paper discusses three approaches to investigating language models (LLMs): externally verifiable outputs, internal mechanisms, and embedding-based quantitative analysis. [2]
Abstract
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
Why we think this paper is great for you:
This paper directly addresses the need for robust methodologies in applying LLMs, a key area of interest for the user.
Deep Learning Architectures
Universidad de Guanajuato
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.
AI Summary
  • In the big data era, data completeness can be as important as algorithm sophistication. [3]
  • Big Data Analytics Distributed Computing Scalability Algorithm Sophistication Data Completeness The chronological progression demonstrates that mastering big data requires a systematic approach. [3]
  • The choice between local and distributed architectures is not merely about computational resources, but about the quality and completeness of the data available to the model. [2]
The University of Hongk
Abstract
Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face significant challenges in achieving high-fidelity document-to-codebase synthesis--such as scientific papers to code--primarily due to a fundamental conflict between information overload and the context bottlenecks of LLMs. In this work, we introduce DeepCode, a fully autonomous framework that fundamentally addresses this challenge through principled information-flow management. By treating repository synthesis as a channel optimization problem, DeepCode seamlessly orchestrates four information operations to maximize task-relevant signals under finite context budgets: source compression via blueprint distillation, structured indexing using stateful code memory, conditional knowledge injection via retrieval-augmented generation, and closed-loop error correction. Extensive evaluations on the PaperBench benchmark demonstrate that DeepCode achieves state-of-the-art performance, decisively outperforming leading commercial agents such as Cursor and Claude Code, and crucially, surpassing PhD-level human experts from top institutes on key reproduction metrics. By systematically transforming paper specifications into production-grade implementations comparable to human expert quality, this work establishes new foundations for autonomous scientific reproduction that can accelerate research evaluation and discovery.
AI Summary
  • DeepCode's performance is significantly better than the best LLM agent baseline, with a 70% relative improvement. [3]
  • LLM: Large Language Model BasicAgent: A general-purpose agent scaffolding IterativeAgent: An improved version of BasicAgent DeepCode outperforms all other baselines, including human experts and state-of-the-art commercial code agents. [3]
  • The paper does not provide a detailed explanation of the algorithm used in DeepCode. [3]
  • The paper presents a framework called DeepCode that is designed to transform machine learning papers into executable code. [3]
  • It uses systematic planning, structured code generation, and automated verification to achieve high performance. [3]
  • Imagine you have a machine learning paper that describes how to build a new AI model. [2]
  • Previous work on code generation has focused on general-purpose agents, but DeepCode's specialized design provides significant advantages over these approaches. [1]
Deep Learning
National University of
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
Accurate forecasting of urban air pollution is essential for protecting public health and guiding mitigation policies. While Deep Learning (DL) and hybrid pipelines dominate recent research, their complexity and limited interpretability hinder operational use. This study investigates whether lightweight additive models -- Facebook Prophet (FBP) and NeuralProphet (NP) -- can deliver competitive forecasts for particulate matter (PM$_{2.5}$, PM$_{10}$) in Beijing, China. Using multi-year pollutant and meteorological data, we applied systematic feature selection (correlation, mutual information, mRMR), leakage-safe scaling, and chronological data splits. Both models were trained with pollutant and precursor regressors, with NP additionally leveraging lagged dependencies. For context, two machine learning baselines (LSTM, LightGBM) and one traditional statistical model (SARIMAX) were also implemented. Performance was evaluated on a 7-day holdout using MAE, RMSE, and $R^2$. Results show that FBP consistently outperformed NP, SARIMAX, and the learning-based baselines, achieving test $R^2$ above 0.94 for both pollutants. These findings demonstrate that interpretable additive models remain competitive with both traditional and complex approaches, offering a practical balance of accuracy, transparency, and ease of deployment.
AI Summary
  • The study also explores the impact of different input features on the performance of the models and finds that using both air quality index and weather data improves the predictive power of the models. [3]
  • AQI: Air Quality Index MAE: Mean Absolute Error The study demonstrates the effectiveness of machine learning models in predicting AQIs and highlights the importance of using both air quality index and weather data for improved predictive power. [3]
  • The results of this study can be used to inform policy decisions related to air pollution control and mitigation strategies. [3]
  • The study only evaluates the performance of different models on a single dataset and does not explore the generalizability of the results to other locations or datasets. [3]
  • The authors do not provide any discussion on the limitations of the study, such as the potential impact of data quality issues or the lack of consideration for non-linear relationships between input features. [3]
  • The paper presents a comparative study of various machine learning models for predicting air quality indices (AQIs) in Beijing, China. [2]
  • The results show that the Prophet model outperforms other models in terms of accuracy, with a mean absolute error (MAE) of 4.35 ΞΌg/mΒ³. [1]
Multimodal Learning
Peking University
Abstract
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
AI Summary
  • The paper presents a novel visuotactile sensor called TACTHRU that combines visual and tactile information to enable robots to perform complex manipulation tasks. [3]
  • TACTHRU uses a combination of computer vision and machine learning algorithms to process the visual and tactile data, allowing it to detect objects, track their movement, and adjust its grasp accordingly. [3]
  • The authors demonstrate the effectiveness of TACTHRU in various scenarios, including grasping and manipulating small objects, opening doors, and even playing a game of Jenga. [3]
  • TACTHRU's performance is compared to other state-of-the-art visuotactile sensors, showing that it outperforms them in terms of accuracy and robustness. [3]
  • The authors also discuss the limitations of TACTHRU and potential future directions for improvement. [3]
  • Visuotactile sensor: A sensor that combines visual and tactile information to enable robots to perform complex manipulation tasks. [3]
  • Machine learning algorithms: Techniques used by computers to learn from data without being explicitly programmed. [3]
  • TACTHRU is a novel visuotactile sensor that combines visual and tactile information to enable robots to perform complex manipulation tasks. [3]
  • Computer vision: The ability of computers to interpret and understand visual data from images or videos. [2]
Shandong Normal Universty
Abstract
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naΓ―ve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
Deep Learning Optimization
University College London
Abstract
\citet{farrell2021deep} establish non-asymptotic high-probability bounds for general deep feedforward neural network (with rectified linear unit activation function) estimators, with \citet[Theorem 1]{farrell2021deep} achieving a suboptimal convergence rate for fully connected feedforward networks. The authors suggest that improved approximation of fully connected networks could yield sharper versions of \citet[Theorem 1]{farrell2021deep} without altering the theoretical framework. By deriving approximation bounds specifically for a narrower fully connected deep neural network, this note demonstrates that \citet[Theorem 1]{farrell2021deep} can be improved to achieve an optimal rate (up to a logarithmic factor). Furthermore, this note briefly shows that deep neural network estimators can mitigate the curse of dimensionality for functions with compositional structure and functions defined on manifolds.
Large Language Models
Adobe Research
Paper visualization
Rate image: πŸ‘ πŸ‘Ž
Abstract
We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today's large language models (LLMs). We describe our ongoing experiments with an implemented system called DEMOCRITUS (Decentralized Extraction of Manifold Ontologies of Causal Relations Integrating Topos Universal Slices) aimed at building, organizing, and visualizing LCMs that span disparate domains extracted from carefully targeted textual queries to LLMs. DEMOCRITUS is methodologically distinct from traditional narrow domain and hypothesis centered causal inference that builds causal models from experiments that produce numerical data. A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements from a diverse range of domains. The technical challenge is then to take these isolated, fragmented, potentially ambiguous and possibly conflicting causal claims, and weave them into a coherent whole, converting them into relational causal triples and embedding them into a LCM. Addressing this technical challenge required inventing new categorical machine learning methods, which we can only briefly summarize in this paper, as it is focused more on the systems side of building DEMOCRITUS. We describe the implementation pipeline for DEMOCRITUS comprising of six modules, examine its computational cost profile to determine where the current bottlenecks in scaling the system to larger models. We describe the results of using DEMOCRITUS over a wide range of domains, spanning archaeology, biology, climate change, economics, medicine and technology. We discuss the limitations of the current DEMOCRITUS system, and outline directions for extending its capabilities.
AI Summary
  • DEMOCRITUS is a large-scale 'Causal Observatory' that constructs LCMs directly from natural-language causal statements. [3]
  • Naive BFS treats all branches equally and does not take into account the structural feedback from the topic graph, causal triples, and GT embeddings. [3]
  • DEMOCRITUS graphs are sparseβ€”edge count scales approximately linearly with node countβ€”and exhibit a strongly skewed, heavy-tailed degree distribution: a small number of variables act as hubs with degree in the hundreds, while the majority of nodes have degree close to one. [3]
  • Causal Observatory Large-scale Causal Complex (LCM) Diagrammatic Backpropagation Geometric Transformer Simplicial Causal Complex DEMOCRITUS is a powerful tool for constructing LCMs from natural-language causal statements. [3]
  • The system uses a high-quality LLM (Qwen3-Next-80B-A3B-Instruct-6bit) in Modules 1–3 and Geometric Transformers and UMAP (Module 5) are comparatively cheap. [2]
Diffusion Models
University of Modena and
Abstract
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.
AI Summary
  • The proposed method is evaluated on several benchmark datasets, including MSCOCO and Flickr30k. [3]
  • Diffusion model: A type of generative model that learns to transform a noise signal into a data distribution. [3]
  • Transformer architecture: A type of neural network architecture that is particularly well-suited for sequence-to-sequence tasks such as machine translation and image captioning. [3]
  • The paper discusses the application of diffusion models in image captioning, a task that involves generating natural language captions for images. [2]
NOVA University of Lisbon
Abstract
Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Deep Learning Models
You can edit or add more interests any time.