Deep Learning Architectures

University of North Carol

Abstract
We develop a deep learning algorithm for approximating functional rational expectations equilibria of dynamic stochastic economies in the sequence space. We use deep neural networks to parameterize equilibrium objects of the economy as a function of truncated histories of exogenous shocks. We train the neural networks to fulfill all equilibrium conditions along simulated paths of the economy. To illustrate the performance of our method, we solve three economies of increasing complexity: the stochastic growth model, a high-dimensional overlapping generations economy with multiple sources of aggregate risk, and finally an economy where households and firms face uninsurable idiosyncratic risk, shocks to aggregate productivity, and shocks to idiosyncratic and aggregate volatility. Furthermore, we show how to design practical neural policy function architectures that guarantee monotonicity of the predicted policies, facilitating the use of the endogenous grid method to simplify parts of our algorithm.

AI Insights

Equilibrium objects are parameterized as functions of truncated exogenous shock histories, enabling temporal learning.
Training enforces all equilibrium conditions along simulated paths, avoiding iterative solves.
Monotonicity is guaranteed by neural architectures, simplifying endogenous grid use.
The method scales to high‑dimensional overlapping‑generations models with multiple aggregate risks.
A test case adds idiosyncratic risk, productivity shocks, and stochastic volatility, showing robustness.
The paper reviews numerical, projection, and machine‑learning methods, positioning deep learning as a unifying framework.
Suggested readings: Judd’s Numerical Methods in Economics and Druedahl & Ropke’s endogenous grid papers.

👍 👎 ♥ Save

Architectural change in neural networks using fuzzy vertex pooling

Abstract
The process of pooling vertices involves the creation of a new vertex, which becomes adjacent to all the vertices that were originally adjacent to the endpoints of the vertices being pooled. After this, the endpoints of these vertices and all edges connected to them are removed. In this document, we introduce a formal framework for the concept of fuzzy vertex pooling (FVP) and provide an overview of its key properties with its applications to neural networks. The pooling model demonstrates remarkable efficiency in minimizing loss rapidly while maintaining competitive accuracy, even with fewer hidden layer neurons. However, this advantage diminishes over extended training periods or with larger datasets, where the model's performance tends to degrade. This study highlights the limitations of pooling in later stages of deep learning training, rendering it less effective for prolonged or large-scale applications. Consequently, pooling is recommended as a strategy for early-stage training in advanced deep learning models to leverage its initial efficiency.

Deep Learning

👍 👎 ♥ Save

Deep Lookup Network

Sun Yatsen University,Sh

Abstract
Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).

AI Insights

Deep Lookup Network replaces costly multiplications with differentiable lookup tables, enabling end‑to‑end training.
Adaptive table refinement and gradient updates accelerate lookup convergence.
State‑of‑the‑art accuracy on image classification, super‑resolution, and point‑cloud tasks is achieved while cutting energy use by up to 60 %.
Lookup tables are jointly optimized with convolutional kernels, yielding a compact model ideal for mobile inference.
The paper surveys quantization, pruning, and knowledge distillation as complementary efficiency techniques.
Recommended resources: the survey “Quantization of Neural Networks” and Goodfellow’s “Deep Learning.”

Diffusion Models

👍 👎 ♥ Save

Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees

Abstract
Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $\tau$-leaping samplers have become particularly popular due to their empirical success. However, existing theoretical analyses of $\tau$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $\tau$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $\tau$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.

👍 👎 ♥ Save

Masked Diffusion Models as Energy Minimization

Gaoling School of AI, Ren

Abstract
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.

AI Insights

Task‑specific schedule tuning is formalized by designing energy functionals that encode each task’s structure.
Beta‑reparameterized schedules beat hand‑crafted ones on math‑reasoning and code‑generation tasks.
Optimized schedules produce mathematically coherent outputs, improving code quality.
A framework for intermediate benchmarks balances practical relevance with analytical tractability.
Energy‑guided schedules cut sampling steps while maintaining high‑quality results.
Authors note that designing energy functionals may not always be feasible, urging further research.

Multimodal Learning

👍 👎 ♥ Save

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Abstract
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

👍 👎 ♥ Save

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng XuShengwu XiongJia

Abstract
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

AI Insights

The challenge’s literature review spotlights CNNs, RNNs, and GANs as the backbone of winning multimodal video methods.
Teams such as ActiveAlphaAgentTeam and adaboostTeam showcased hybrid architectures that fuse visual, audio, and textual cues.
Overfitting and interpretability remain the biggest hurdles, prompting researchers to explore explainable deep‑learning tricks.
The official resources include a comprehensive review book and five survey papers that map the field’s rapid evolution.
Participants leveraged the Lens dataset’s 12 daily scenarios and AdsQA’s ad‑video clips to push spatial reasoning limits.
The competition’s 40+ baselines and 15+ participant methods illustrate the vibrant cross‑disciplinary collaboration in multimodal AI.
Curious readers can dive deeper into “Multi‑Modal Video Analysis and Synthesis: A Comprehensive Review” for a thorough primer.

Deep Learning Optimization

👍 👎 ♥ Save

A Deep-Learning-Driven Optimization-Based Inverse Solver for Accelerating the Marchenko Method

Abstract
The Marchenko method is a powerful tool for reconstructing full-wavefield Green's functions using surface-recorded seismic data. These Green's functions can then be utilized to produce subsurface images that are not affected by artifacts caused by internal multiples. Despite its advantages, the method is computationally demanding, primarily due to the iterative nature of estimating the focusing functions, which links the Green's functions to the surface reflection response. To address this limitation, an optimization-based solver is proposed to estimate focusing functions in an efficient way. This is achieved by training a network to approximate the forward modeling problem on a small subset of pre-computed focusing function pairs, mapping final up-going focusing functions obtained via the conventional iterative scheme to their initial estimates. Once trained, the network is fixed and used as the forward operator within the Marchenko framework. For a given target location, an input is initialized and iteratively updated through backpropagation to minimize the mismatch between the output of the fixed network and the known initial up-going focusing function. The resulting estimate is then used to compute the corresponding down-going focusing function and the full Green's functions based on the Marchenko equations. This strategy significantly reduces the computational cost compared to the traditional Marchenko method based on conventional iterative scheme. Tests on a synthetic model, using only 0.8% of the total imaging points for training, show that the proposed approach accelerates the imaging process while maintaining relatively good imaging results, which is better than reverse time migration. Application to the Volve field data further demonstrates the method's robustness and practicality, highlighting its potential for efficient, large scale seismic imaging.

Large Language Models

👍 👎 ♥ Save

MALLM: Multi-Agent Large Language Models Framework

University of Gttingen

Abstract
Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM is tailored towards researchers and provides a window into the heart of multi-agent debate, facilitating the understanding of its components and their interplay.

AI Insights

MALLM’s open‑source repo hosts 144 distinct debate setups, letting researchers swap agent personas, response generators, and discussion paradigms with a single config file.
The evaluation pipeline automatically benchmarks any Huggingface text dataset, such as MMLU‑Pro or WinoGrande, across all chosen decision protocols.
Config files expose fine‑grained knobs—repeats, max turns, concurrent API requests, and sample size—enabling reproducible, large‑scale experiments.
“Stay Focused: Problem Drift in Multi‑Agent Debate” offers a deep dive into dynamic debate contexts that MALLM can simulate.
“Voting or Consensus? Decision‑Making in Multi‑Agent Debate” compares protocol efficacy and is a must‑read for protocol designers.
MALLM’s modular design lets you plug in new agent personas or response generators without touching the core codebase.
High compute demands and a steep learning curve for the config syntax are the main practical hurdles to keep in mind.

👍 👎 ♥ Save

Rationality Check! Benchmarking the Rationality of Large Language Models

Tsinghua University

Abstract
Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important concepts in assessing human behavior, both in thinking (i.e., theoretical rationality) and in taking action (i.e., practical rationality). In this work, we propose the first benchmark for evaluating the omnibus rationality of LLMs, covering a wide range of domains and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental results, and analysis that illuminates where LLMs converge and diverge from idealized human rationality. We believe the benchmark can serve as a foundational tool for both developers and users of LLMs.

AI Insights

Collective rationality lets multiple LLMs collaborate and decide, a new research frontier!
The benchmark mixes human judgment and automated metrics, showing individual LLMs excel but teams lag.
Shortfalls appear in decision‑making, problem‑solving, and reasoning, revealing coordination gaps.
Improving collective rationality requires better alignment, smarter coordination, and richer metrics.
Foundational texts like Collective Intelligence: Making a Good Thing Better guide these efforts!
Critiques highlight heavy human evaluation and limited metrics, urging more objective measures.
The study urges developers to craft AI that is not only rational alone but also trustworthy in teams.

Mixture of Experts

👍 👎 ♥ Save

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

Rate this image: 😍 👍 👎

Abstract
Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.

👍 👎 ♥ Save

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

TMCC, College of Computer

Abstract
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.

Interests not found

Help us improve your experience!