High throughput

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Rate this image: 😍 👍 👎

Abstract
The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

👍 👎 ♥ Save

Using Age of Information for Throughput Optimal Spectrum Sharing

Abstract
We consider a spectrum sharing problem where two users attempt to communicate over N channels. The Primary User (PU) has prioritized transmissions and its occupancy on each channel over time can be modeled as a Markov chain. The Secondary User (SU) needs to determine which channels are free at each time-slot and attempt opportunistic transmissions. The goal of the SU is to maximize its own throughput, while simultaneously minimizing collisions with the PU, and satisfying spectrum access constraints. To solve this problem, we first decouple the multiple-channel problem into N single-channel problems. For each decoupled problem, we prove that there exists an optimal threshold policy that depends on the last observed PU occupancy and the freshness of this occupancy information. Second, we establish the indexability of the decoupled problems by analyzing the structure of the optimal threshold policy. Using this structure, we derive a Whittle index-based scheduling policy that allocates SU transmissions using the Age of Information (AoI) of accessed channels. We also extend our insights to PU occupancy models that are correlated across channels and incorporate learning of unknown Markov transition matrices into our policies. Finally, we provide detailed numerical simulations that demonstrate the performance gains of our approach.

Low latency

👍 👎 ♥ Save

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Rate this image: 😍 👍 👎

Abstract
Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

👍 👎 ♥ Save

CALL: Context-Aware Low-Latency Retrieval in Disk-Based Vector Databases

Abstract
Embedding models capture both semantic and syntactic structures of queries, often mapping different queries to similar regions in vector space. This results in non-uniform cluster access patterns in modern disk-based vector databases. While existing approaches optimize individual queries, they overlook the impact of cluster access patterns, failing to account for the locality effects of queries that access similar clusters. This oversight increases cache miss penalty. To minimize the cache miss penalty, we propose CALL, a context-aware query grouping mechanism that organizes queries based on shared cluster access patterns. Additionally, CALL incorporates a group-aware prefetching method to minimize cache misses during transitions between query groups and latency-aware cluster loading. Experimental results show that CALL reduces the 99th percentile tail latency by up to 33% while consistently maintaining a higher cache hit ratio, substantially reducing search latency.

Resilience

👍 👎 ♥ Save

Computing Resilience Measures in Dynamical Systems

Rate this image: 😍 👍 👎

Abstract
Resilience broadly describes a quality of withstanding perturbations. Measures of system resilience have gathered increasing attention across applied disciplines, yet existing metrics often lack computational accessibility and generalizability. In this work, we review the literature on resilience measures through the lens of dynamical systems theory and numerical methods. In this context, we reformulate pertinent measures into a general form and introduce a resource-efficient algorithm designed for their parallel numerical estimation. By coupling these measures with a global continuation of attractors, we enable their consistent evaluation along system parameter changes. The resulting framework is modular and easily extendable, allowing for the incorporation of new resilience measures as they arise. We demonstrate the framework on a range of illustrative dynamical systems, revealing key differences in how resilience changes across systems. This approach provides a more global perspective compared to traditional linear stability metrics used in local bifurcation analysis, which can overlook inconspicuous but significant shifts in system resilience. This work opens the door to genuinely novel lines of inquiry, such as the development of new early warning signals for critical transitions or the discovery of universal scaling behaviours. All code and computational tools are provided as an open-source contribution to the DynamicalSystems.jl software library.

👍 👎 ♥ Save

Robustness and resilience of complex networks

Departament de Fsica de

Abstract
Complex networks are ubiquitous: a cell, the human brain, a group of people and the Internet are all examples of interconnected many-body systems characterized by macroscopic properties that cannot be trivially deduced from those of their microscopic constituents. Such systems are exposed to both internal, localized, failures and external disturbances or perturbations. Owing to their interconnected structure, complex systems might be severely degraded, to the point of disintegration or systemic dysfunction. Examples include cascading failures, triggered by an initially localized overload in power systems, and the critical slowing downs of ecosystems which can be driven towards extinction. In recent years, this general phenomenon has been investigated by framing localized and systemic failures in terms of perturbations that can alter the function of a system. We capitalize on this mathematical framework to review theoretical and computational approaches to characterize robustness and resilience of complex networks. We discuss recent approaches to mitigate the impact of perturbations in terms of designing robustness, identifying early-warning signals and adapting responses. In terms of applications, we compare the performance of the state-of-the-art dismantling techniques, highlighting their optimal range of applicability for practical problems, and provide a repository with ready-to-use scripts, a much-needed tool set.

AI Insights

Benchmarking CI 2, CoreHD, FINDER, GDM+R, GND+R, MS+R shows distinct efficiency across topologies.
Structural interdependency, thresholding, and overloading are formalized as separate failure drivers.
Critical‑slowing‑down metrics provide early‑warning signals for impending cascades.
Adaptive rewiring models spontaneous recovery, restoring giant components post‑perturbation.
Methods are tested on the 309‑node Brazilian corruption network, US‑South Canada power grid, and Twitter Higgs cascades.
Key references: Albert et al. (2000) on error tolerance, Cohen et al. (2000) on resilience, Callaway et al. (2000) on percolation.
Heterogeneous connectivity patterns enhance robustness, while homogeneous networks are more vulnerable to targeted attacks.

Distributed Systems

👍 👎 ♥ Save

Building cluster systems

CNRS, Universit e de de

Rate this image: 😍 👍 👎

Abstract
Classical spin liquids are frustrated magnetic phases characterized by local constraints, flat bands in reciprocal space, and emergent gauge structures with distinctive signatures such as pinch points. These arise generally in \emph{cluster systems}, where spin interactions can be expressed as constraints on clusters of spins. In this work we present the different generic rules allowing to build such cluster systems together with a few tools allowing to quickly characterize it. We show that based on these rules, it is possible to conceive a tunable recipe for generating such models by decorating a parent lattice on its bonds and/or vertices with symmetry-compatible clusters. This approach highlights a key design trade-off: using fewer cluster types increases the number of flat bands and enhances spin-liquid behavior, but produces denser connectivity that is harder to realize experimentally. The framework is highly tunable, extends naturally to two and three dimensions, and provides a versatile toolbox for engineering new classical spin-liquid candidates with targeted features such as higher-rank pinch points or pinch lines.

👍 👎 ♥ Save

Communication-Efficient and Interoperable Distributed Learning

Abstract
Collaborative learning across heterogeneous model architectures presents significant challenges in ensuring interoperability and preserving privacy. We propose a communication-efficient distributed learning framework that supports model heterogeneity and enables modular composition during inference. To facilitate interoperability, all clients adopt a common fusion-layer output dimension, which permits each model to be partitioned into a personalized base block and a generalized modular block. Clients share their fusion-layer outputs, keeping model parameters and architectures private. Experimental results demonstrate that the framework achieves superior communication efficiency compared to federated learning (FL) and federated split learning (FSL) baselines, while ensuring stable training performance across heterogeneous architectures.

Help us improve your experience!